Remove ChunkedHybridCache from benchmark_inference.py #2733

IvanYashchuk · 2025-11-13T13:21:35Z

ChunkedHybridCache used in the inference benchmark is deprecated and should be replaced with StaticCache (https://github.com/huggingface/transformers/blob/ce40ca0d4c7d2e0a3f8bd3ddc30f29c6a105efb5/src/transformers/cache_utils.py#L1356).

This PR also removes unused keyword arguments when initializing StaticCache.

cc @crcrpar

for more information, see https://pre-commit.ci

IvanYashchuk · 2025-11-13T13:33:34Z

@kshitij12345, @riccardofelluga could you please review the change?

kshitij12345

With pjnl-20251113 (and transformers version 4.55.4), running python thunder/benchmarks/benchmark_inference.py --model-name meta-llama/Llama-4-Maverick-17B-128E --mode eager --input-length 1024 --output-length 32 --batch-size 1 --num-iterations 20 --num-layers 2

leads to

Warming up with 10 iterations...
Traceback (most recent call last):
  File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 733, in <module>
    main()
  File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 722, in main
    benchmark.run_benchmark()
  File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 458, in run_benchmark
    input_ids, past_key_values = self.generate_batch()
                                 ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_inference.py", line 342, in generate_batch
    past_key_values = StaticCache(
                      ^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/cache_utils.py", line 1451, in __init__
    super().__init__(layer_classes=StaticLayer, *args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/transformers/cache_utils.py", line 1110, in __init__
    self.append_new_layers(self.num_hidden_layers - 1)
  File "/usr/local/lib/python3.12/dist-packages/transformers/cache_utils.py", line 1172, in append_new_layers
    new_layer = new_layer_class(**kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: StaticLayer.__init__() missing 1 required positional argument: 'batch_size'

kshitij12345 · 2025-11-13T14:56:45Z

thunder/benchmarks/benchmark_inference.py

-            # Transformers deprecated HybridChunkedCache in favour of static in 4.55.x
-            past_key_values = StaticCache(
-                config=self.hf_config,
-                max_batch_size=input_ids.shape[0],


Looking at the error here, I think max_batch_size is required.

Thank you for running it with transformers version 4.55.4! I was running with the latest release. Need to update the requirements pin first before merging this change.

kshitij12345 · 2025-11-13T15:21:09Z

thunder/benchmarks/benchmark_inference.py

-                max_batch_size=input_ids.shape[0],
-                max_cache_len=input_ids.shape[1] + self.config.output_length,
-                device=DEVICE,
-                dtype=torch.bfloat16,


Also, device and dtype seem necessary -

from transformers.cache_utils import StaticCache from transformers import AutoConfig, AutoModelForCausalLM import torch model_id = "meta-llama/Llama-4-Maverick-17B-128E" config = AutoConfig.from_pretrained(model_id) if hasattr(config, "text_config"): config = config.text_config config.num_hidden_layers = 2 past_key_values = StaticCache(config=config, max_batch_size=1, max_cache_len=256) print(past_key_values.layers[0].keys.dtype) # torch.float32 print(past_key_values.layers[0].keys.device) # cpu past_key_values = StaticCache(config=config, max_batch_size=1, max_cache_len=256, dtype=torch.bfloat16, device="cuda") print(past_key_values.layers[0].keys.dtype) # torch.bfloat16 print(past_key_values.layers[0].keys.device) # cuda:0

riccardofelluga

Good idea to move on to the StaticCache. Just need couple of fixed on the args of the object.

Does perf improve?

riccardofelluga · 2025-11-13T15:15:37Z

thunder/benchmarks/benchmark_inference.py

-                dummy_key_states = torch.empty(1, self.hf_config.num_key_value_heads // WORLD_SIZE, 1, 1, device=DEVICE)
-                past_key_values.initialise_cache_layer(layer_idx, dummy_key_states)
+        past_key_values = StaticCache(
+            config=self.hf_config,


Suggested change

config=self.hf_config,

config=self.hf_config,

max_batch_size=input_ids.shape[0],

riccardofelluga · 2025-11-13T15:18:39Z

thunder/benchmarks/benchmark_inference.py

-                past_key_values.initialise_cache_layer(layer_idx, dummy_key_states)
+        past_key_values = StaticCache(
+            config=self.hf_config,
+            max_cache_len=input_ids.shape[1] + self.config.output_length,


Also device and dtype seem to be required:

RuntimeError: Expected all tensors to be on the same device, but got mat2 is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA_bmm)

Suggested change

max_cache_len=input_ids.shape[1] + self.config.output_length,

max_cache_len=input_ids.shape[1] + self.config.output_length,

device=DEVICE,

dtype=torch.bfloat16,

IvanYashchuk · 2025-11-13T16:07:54Z

Good idea to move on to the StaticCache.

It's not moving on, it's already used because of if LooseVersion(transformers.__version__) >= LooseVersion("4.55"): line.

kshitij12345 · 2025-11-13T16:11:21Z

thunder/benchmarks/benchmark_inference.py

-            for layer_idx in range(self.hf_config.num_hidden_layers):
-                # key_states.shape[1] is used to retrieve the number of key value heads, all other dimensions can be 1 and ignored
-                # https://github.com/huggingface/transformers/blob/9300728665aaeb0ebf4db99f9d9fbce916b4a183/src/transformers/cache_utils.py#L1822
-                dummy_key_states = torch.empty(1, self.hf_config.num_key_value_heads // WORLD_SIZE, 1, 1, device=DEVICE)


We also need to preserve hf_config.num_key_value_heads // WORLD_SIZE for distributed setting.

The patch can be something like the following

diff --git a/thunder/benchmarks/benchmark_inference.py b/thunder/benchmarks/benchmark_inference.py index 212f5f8e..13af8175 100644 --- a/thunder/benchmarks/benchmark_inference.py +++ b/thunder/benchmarks/benchmark_inference.py @@ -339,9 +339,15 @@ class InferenceBenchmark: input_length = self.config.input_length input_ids = torch.randint(0, self.vocab_size, (batch_size, input_length), device=DEVICE) + import copy + hf_config = copy.copy(self.hf_config) + hf_config.num_key_value_heads //= WORLD_SIZE past_key_values = StaticCache( - config=self.hf_config, + config=hf_config, max_cache_len=input_ids.shape[1] + self.config.output_length, + max_batch_size=batch_size, + dtype=torch.bfloat16, + device=DEVICE, ) return input_ids, past_key_values

IvanYashchuk added 2 commits November 13, 2025 15:11

Remove deprecated HybridChunkedCache from benchmark_inference.py

71e7d28

Remove ignored kwargs from StaticCache construction

ef6c46b

IvanYashchuk added the benchmarking label Nov 13, 2025

IvanYashchuk requested review from KaelanDt, lantiga, mruberry and t-vi as code owners November 13, 2025 13:21

[pre-commit.ci] auto fixes from pre-commit.com hooks

8927334

for more information, see https://pre-commit.ci

riccardofelluga self-requested a review November 13, 2025 14:06

kshitij12345 requested changes Nov 13, 2025

View reviewed changes

kshitij12345 reviewed Nov 13, 2025

View reviewed changes

riccardofelluga requested changes Nov 13, 2025

View reviewed changes

kshitij12345 reviewed Nov 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove ChunkedHybridCache from benchmark_inference.py #2733

Remove ChunkedHybridCache from benchmark_inference.py #2733

Uh oh!

IvanYashchuk commented Nov 13, 2025 •

edited by github-actions bot

Loading

Uh oh!

IvanYashchuk commented Nov 13, 2025

Uh oh!

kshitij12345 left a comment

Uh oh!

kshitij12345 Nov 13, 2025

Uh oh!

IvanYashchuk Nov 13, 2025

Uh oh!

kshitij12345 Nov 13, 2025

Uh oh!

riccardofelluga left a comment

Uh oh!

riccardofelluga Nov 13, 2025

Uh oh!

riccardofelluga Nov 13, 2025

Uh oh!

IvanYashchuk commented Nov 13, 2025

Uh oh!

kshitij12345 Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	config=self.hf_config,
	config=self.hf_config,
	max_batch_size=input_ids.shape[0],

Remove ChunkedHybridCache from benchmark_inference.py #2733

Are you sure you want to change the base?

Remove ChunkedHybridCache from benchmark_inference.py #2733

Uh oh!

Conversation

IvanYashchuk commented Nov 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IvanYashchuk commented Nov 13, 2025

Uh oh!

kshitij12345 left a comment

Choose a reason for hiding this comment

Uh oh!

kshitij12345 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

kshitij12345 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

riccardofelluga left a comment

Choose a reason for hiding this comment

Uh oh!

riccardofelluga Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

riccardofelluga Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk commented Nov 13, 2025

Uh oh!

kshitij12345 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IvanYashchuk commented Nov 13, 2025 •

edited by github-actions bot

Loading