-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Closed
Labels
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
Happens on:
- commit: 3ae0f1936cb4218be90a1d00d29ad55d0a2ccef0
- PRs: [DO NOT MERGE] 2.9, Inductor partition, standalone compile, monkeypatch fix(es) #26738 + [torch.compile] Enable attention and allreduce fusion without custom ops enabled #24604
Command:
pytest tests/compile/test_fusions_e2e.py -s -v
# not tested but should be reproducible on just #26738 with:
python examples/offline_inference/basic/generate.py --model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4 --kv-cache-dtype=fp8 -O.pass_config='{"enable_noop":true, "enable_attn_fusion": true} -O.use_inductor_graph_partition
# tested, also reproduces:
python examples/offline_inference/basic/generate.py --model RedHatAI/Qwen3-30B-A3B-NVFP4 --kv-cache-dtype=fp8 --no-enable-prefix-caching -O.pass_config='{"enable_attn_fusion":true,"enable_noop":true}' -O.use_inductor_graph_partition=True -O.cudagraph_mode=FULL_AND_PIECEWISE
# this works:
chg run -g=1 -- python examples/offline_inference/basic/generate.py --model RedHatAI/Qwen3-30B-A3B-NVFP4 --kv-cache-dtype=fp8 --no-enable-prefix-caching -O.pass_config='{"enable_attn_fusion":false,"enable_noop":true}' -O.use_inductor_graph_partition=True -O.cudagraph_mode=FULL_AND_PIECEWISE
Failing test, note that same model with FP8 quant (`nvidia/Llama-4-Scout-17B-16E-Instruct-FP8) succeeds, also same model works without inductor partition:
FAILED tests/compile/test_fusions_e2e.py::test_attn_quant[True-nvidia/Llama-4-Scout-17B-16E-Instruct-FP4-model_kwargs2-_Backend.FLASHINFER-48-96-] - RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
============================ 1 failed, 5 passed, 8 skipped, 2 warnings in 832.85s (0:13:52) ============================
The last part of the stack trace:
(EngineCore_DP0 pid=2090223) File "/home/ProExpertProg/git/vllm2/.venv/lib/python3.12/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 526, in wrapper
(EngineCore_DP0 pid=2090223) return compiled_fn(runtime_args)
(EngineCore_DP0 pid=2090223) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2090223) File "/home/ProExpertProg/git/vllm2/.venv/lib/python3.12/site-packages/torch/_inductor/output_code.py", line 613, in __call__
(EngineCore_DP0 pid=2090223) return self.current_callable(inputs)
(EngineCore_DP0 pid=2090223) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=2090223) File "/tmp/torchinductor_ProExpertProg/tmp5ccrosm8/3w/c3w6voddnusogdjqye5gnwhaq6h5yvgtn33akkevdv44fkyfjaqn.py", line 34387, in call
(EngineCore_DP0 pid=2090223) del buf15
(EngineCore_DP0 pid=2090223) ^^^^^
(EngineCore_DP0 pid=2090223) UnboundLocalError: cannot access local variable 'buf15' where it is not associated with a value
[rank0]:[W1016 00:28:16.398287624 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
FAILED
Upon inspection, there's just a del buf15
in the output code even though buf15
no longer exists. If I had to guess, it is a leftover relic of a node/tensor corresponding to the unquantized output.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Done