Skip to content

Handle vLLM 0.17 rank trace filenames#7

Draft
powderluv wants to merge 1 commit intoAMD-AGI:mainfrom
powderluv:fix-vllm017-trace-rank-detection
Draft

Handle vLLM 0.17 rank trace filenames#7
powderluv wants to merge 1 commit intoAMD-AGI:mainfrom
powderluv:fix-vllm017-trace-rank-detection

Conversation

@powderluv
Copy link

Summary

  • match both legacy -rank-0 and current vLLM rank0 torch profiler filenames
  • exclude async_llm coordinator traces from per-rank gap analysis discovery
  • keep the fallback path limited to non-async trace files

Problem

With vllm 0.17.0, torch profiler outputs rank traces like:

  • dp0_pp0_tp0_dcp0_ep0_rank0....pt.trace.json.gz
  • dp0_pp0_tp1_dcp0_ep1_rank1....pt.trace.json.gz

and also emits a coordinator trace like:

  • ...async_llm....pt.trace.json.gz

GapAnalyzer.detect_trace_files() only matched the older -rank-0 style. That meant rank detection failed, the code fell back to *.json.gz, and async_llm was incorrectly counted as an extra rank.

Validation

Validated against a real vllm 0.17.0 Kimi K2.5 trace directory from an MI355 run:

  • before: fallback path picked up 5 traces including async_llm
  • after: detector returns exactly 4 rank traces (rank0..rank3)

Notes

I did not add a unit test in this patch because I was validating against the live trace artifacts first. The next clean step would be a small test covering both filename styles and the async_llm exclusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant