Skip to content

[Bug]: Another issue with Inductor partition codegen for attn+nvfp4 quant fusion #26988

@ProExpertProg

Description

@ProExpertProg

Your current environment

The output of python collect_env.py
Your output of `python collect_env.py` here

🐛 Describe the bug

Happens on:

Command:

pytest tests/compile/test_fusions_e2e.py -s -v

# not tested but should be reproducible on just #26738 with:
python examples/offline_inference/basic/generate.py --model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 --kv-cache-dtype=fp8 -O.pass_config='{"enable_noop":true, "enable_attn_fusion": true} -O.use_inductor_graph_partition

# tested, also reproduces:
python examples/offline_inference/basic/generate.py --model RedHatAI/Qwen3-30B-A3B-NVFP4 --kv-cache-dtype=fp8 --no-enable-prefix-caching -O.pass_config='{"enable_attn_fusion":true,"enable_noop":true}' -O.use_inductor_graph_partition=True -O.cudagraph_mode=FULL_AND_PIECEWISE

# this works:
chg run -g=1 -- python examples/offline_inference/basic/generate.py --model RedHatAI/Qwen3-30B-A3B-NVFP4 --kv-cache-dtype=fp8 --no-enable-prefix-caching -O.pass_config='{"enable_attn_fusion":false,"enable_noop":true}' -O.use_inductor_graph_partition=True -O.cudagraph_mode=FULL_AND_PIECEWISE

Failing test, note that same model with FP8 quant (`nvidia/Llama-4-Scout-17B-16E-Instruct-FP8) succeeds, also same model works without inductor partition:

FAILED tests/compile/test_fusions_e2e.py::test_attn_quant[True-nvidia/Llama-4-Scout-17B-16E-Instruct-FP4-model_kwargs2-_Backend.FLASHINFER-48-96-] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
============================ 1 failed, 5 passed, 8 skipped, 2 warnings in 832.85s (0:13:52) ============================

The last part of the stack trace:

(EngineCore_DP0 pid=2090223)   File "/home/ProExpertProg/git/vllm2/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=2090223)     return compiled_fn(runtime_args)
(EngineCore_DP0 pid=2090223)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2090223)   File "/home/ProExpertProg/git/vllm2/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 613, in __call__
(EngineCore_DP0 pid=2090223)     return self.current_callable(inputs)
(EngineCore_DP0 pid=2090223)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2090223)   File "/tmp/torchinductor_ProExpertProg/tmp5ccrosm8/3w/c3w6voddnusogdjqye5gnwhaq6h5yvgtn33akkevdv44fkyfjaqn.py", line 34387, in call
(EngineCore_DP0 pid=2090223)     del buf15
(EngineCore_DP0 pid=2090223)         ^^^^^
(EngineCore_DP0 pid=2090223) UnboundLocalError: cannot access local variable 'buf15' where it is not associated with a value
[rank0]:[W1016 00:28:16.398287624 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
FAILED

Upon inspection, there's just a del buf15 in the output code even though buf15 no longer exists. If I had to guess, it is a leftover relic of a node/tensor corresponding to the unquantized output.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Done

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions