Skip to content

Phi-3.5-mini QNN example broken by latest transformers #2051

@jake-leland-dell

Description

@jake-leland-dell

Describe the bug
Olive GptqQuantizer pass fails with RuntimeError: The size of tensor a (32) must match the size of tensor b (96) at non-singleton dimension 3

To Reproduce
Follow Phi 3.5 mini example: https://github.com/microsoft/Olive/tree/main/examples/phi3_5

Expected behavior
olive run --config qnn_config.json should complete successfully.

Olive config
Phi 3.5 mini example config: https://github.com/microsoft/Olive/blob/main/examples/phi3_5/qnn_config.json

Olive logs

...
[2025-08-04 19:30:21,431] [INFO] [engine.py:686:_run_pass] Running pass g:gptqquantizer
WARNING - AutoGPTQ has stopped development. Please transition to GPTQModel: https://github.com/ModelCoud/GPTQModel
GPTQModel has been merged into Transformers/Optimum and full deprecation of AutoGPTQ within HF frameworks is planned in the near-future.
/venv-quant/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:410: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/venv-quant/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:418: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/venv-quant/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
CUDA extension not installed.
CUDA extension not installed.
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 160.16it/s]
INFO - Start quantizing layer 1/32
INFO - Quantizing self_attn.qkv_proj in layer 1/32...
INFO - Quantizing self_attn.o_proj in layer 1/32...
INFO - Quantizing mlp.gate_up_proj in layer 1/32...
INFO - Quantizing mlp.down_proj in layer 1/32...
INFO - Start quantizing layer 2/32
[2025-08-04 19:30:30,591] [ERROR] [engine.py:755:_run_pass] Pass run failed.
Traceback (most recent call last):
  File "/venv-quant/lib/python3.12/site-packages/olive/engine/engine.py", line 743, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/systems/local.py", line 29, in run_pass
    output_model = the_pass.run(model, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/passes/olive_pass.py", line 242, in run
    output_model = self._run_for_config(model, self.config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/passes/pytorch/autogptq.py", line 175, in _run_for_config
    quantized_model.quantize(dataset)
  File "/venv-quant/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/auto_gptq/modeling/_base.py", line 334, in quantize
    layer(*layer_input, **additional_layer_inputs)
  File "/venv-quant/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/transformers/models/phi3/modeling_phi3.py", line 260, in forward
    hidden_states, self_attn_weights = self.self_attn(
                                       ^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/transformers/models/phi3/modeling_phi3.py", line 185, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/transformers/models/phi3/modeling_phi3.py", line 139, in apply_rotary_pos_emb
    q_embed = torch.cat([(q_rot * cos) + (rotate_half(q_rot) * sin), q_pass], dim=-1)
                          ~~~~~~^~~~~
RuntimeError: The size of tensor a (32) must match the size of tensor b (96) at non-singleton dimension 3
[2025-08-04 19:30:30,593] [WARNING] [engine.py:318:run_accelerator] Failed to run Olive on npu-qnn.
Traceback (most recent call last):
  File "/venv-quant/lib/python3.12/site-packages/olive/engine/engine.py", line 314, in run_accelerator
    output_footprint = self._run_no_search(input_model_config, input_model_id, accelerator_spec, output_dir)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/engine/engine.py", line 358, in _run_no_search
    should_prune, signal, model_ids = self._run_passes(input_model_config, input_model_id, accelerator_spec)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/engine/engine.py", line 642, in _run_passes
    model_config, model_id = self._run_pass(
                             ^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/engine/engine.py", line 743, in _run_pass
    output_model_config = host.run_pass(p, input_model_config, output_model_path)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/systems/local.py", line 29, in run_pass
    output_model = the_pass.run(model, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/passes/olive_pass.py", line 242, in run
    output_model = self._run_for_config(model, self.config, output_model_path)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/olive/passes/pytorch/autogptq.py", line 175, in _run_for_config
    quantized_model.quantize(dataset)
  File "/venv-quant/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/auto_gptq/modeling/_base.py", line 334, in quantize
    layer(*layer_input, **additional_layer_inputs)
  File "/venv-quant/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/transformers/models/phi3/modeling_phi3.py", line 260, in forward
    hidden_states, self_attn_weights = self.self_attn(
                                       ^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/transformers/models/phi3/modeling_phi3.py", line 185, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv-quant/lib/python3.12/site-packages/transformers/models/phi3/modeling_phi3.py", line 139, in apply_rotary_pos_emb
    q_embed = torch.cat([(q_rot * cos) + (rotate_half(q_rot) * sin), q_pass], dim=-1)
                          ~~~~~~^~~~~
RuntimeError: The size of tensor a (32) must match the size of tensor b (96) at non-singleton dimension 3

Other information

  • OS: Ubuntu 22.04.4 LTS
  • Olive version: main - (70b5beb)
  • Transformers package version: transformers==4.54.1

Additional context
Downgrading to transformers==4.53.* avoids the error.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions