Skip to content

[NV] perf: update MiniMax-M3 FP4 B300 vLLM#1990

Open
anish-shanbhag wants to merge 4 commits into
mainfrom
codex/minimax-m3-b300-fp4-vllm-update
Open

[NV] perf: update MiniMax-M3 FP4 B300 vLLM#1990
anish-shanbhag wants to merge 4 commits into
mainfrom
codex/minimax-m3-b300-fp4-vllm-update

Conversation

@anish-shanbhag

Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

Comment thread perf-changelog.yaml
Comment on lines +4430 to +4435
- config-keys:
- minimaxm3-fp4-b300-vllm
description:
- "Update Minimax M3 b300 vllm image tag"
- "Update search space to cover more configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/tree/codex/minimax-m3-b300-fp4-vllm-update

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry has two issues: (1) pr-link uses a branch URL (tree/codex/minimax-m3-b300-fp4-vllm-update) instead of the canonical pull/1990 form used by every other entry in the file — this will 404 once the branch is deleted after merge; (2) the description says "Update search space to cover more configs", but the diff actually narrows the sweep (isl 1024: 7→4 entries, isl 8192: 6→4 entries, all TP8/EP8 and TP4/EP4 lanes dropped). Consider phrasing similar to the neighboring dsr1-fp4-b200-sglang entry (e.g. "Refocus/narrow the search space and drop TP8/EP8 and TP4/EP4 sweeps").

Extended reasoning...

Issue 1 — branch URL in pr-link:\n\nThe new entry at perf-changelog.yaml line 4435 sets:\n\nyaml\npr-link: https://github.com/SemiAnalysisAI/InferenceX/tree/codex/minimax-m3-b300-fp4-vllm-update\n\n\nOf the ~554 pr-link values in this file, this is the only one that points to a branch tree/... URL — all others use the canonical https://github.com/SemiAnalysisAI/InferenceX/pull/<num> form (see the four most recent entries at lines 4402–4428 for examples). Since branches are typically deleted after merge, this link will 404 shortly after this PR lands. The correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/1990.\n\nIssue 2 — description contradicts the diff:\n\nThe entry claims:\n\nyaml\n- "Update search space to cover more configs"\n\n\nBut the diff clearly narrows the sweep. Step-by-step count from the diff on .github/configs/nvidia-master.yaml:\n\nisl 1024, osl 1024:\n- Before (7 entries): {tp:8, c1-64}, {tp:8/ep:8, c1-512}, {tp:4, c1-64}, {tp:4/ep:4, c64-512}, {tp:4/ep:4/dp-attn, c128-512}, {tp:2/ep:2, c16-128}, {tp:8/ep:8/dp-attn, c256-1024}\n- After (4 entries): {tp:8, c1-2}, {tp:4, c2-2}, {tp:2, c4-256}, {tp:2/ep:2/dp-attn, c512-1024}\n\nisl 8192, osl 1024:\n- Before (6 entries): {tp:8, c1-64}, {tp:8/ep:8, c1-512}, {tp:4, c1-128}, {tp:4/ep:4, c64-256}, {tp:4/ep:4/dp-attn, c64-128}, {tp:8/ep:8/dp-attn, c128-256}\n- After (4 entries): {tp:8, c1-2}, {tp:4, c2-2}, {tp:2, c4-256}, {tp:2/ep:2/dp-attn, c512-1024}\n\nEvery TP8/EP8 and TP4/EP4 lane is dropped, and the TP8/TP4 lanes are drastically clipped (tp:8 from c1-64 down to c1-2; tp:4 from c1-64 down to a single c2 point). While tp:2 and tp:2/ep:2/dp-attn are genuinely new, the aggregate coverage is significantly smaller (roughly 38 concurrency points → ~12 at isl 1024). Calling this "cover more configs" is the opposite of what the change actually does.\n\nImpact:\n\nBoth issues are purely documentation quality — no runtime effect, and no CI schema check enforces the URL form. But the perf-changelog is a user-facing history document that people read to understand why a config changed; a stale link plus a misleading description defeats that purpose.\n\nFix:\n\nyaml\n- config-keys:\n - minimaxm3-fp4-b300-vllm\n description:\n - "Update Minimax M3 b300 vllm image tag"\n - "Refocus/narrow the search space and drop TP8/EP8 and TP4/EP4 sweeps."\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1990\n\n\nThe neighboring dsr1-fp4-b200-sglang entry at lines 4422–4428 uses similar honest phrasing when reducing coverage ("Drop the TP8/EP8 single-concurrency points.").

@@ -12809,7 +12809,7 @@ minimaxm3-fp8-b300-vllm:
# weights are pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added to
# the STAGED_MODELS allow-list in launch_b300-nv.sh).
minimaxm3-fp4-b300-vllm:
image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The block comment just above this line still says the NVFP4 support 'is baked into the perf container image', but the PR swaps the image from the bespoke vllm-minimax-m3-perf-* tag to a mainline nightly-* tag. Update the comment to say the support has landed in vLLM main and is picked up from nightly so future readers don't chase a nonexistent perf image. Note: the same wording appears at lines 12839-12840 for the EAGLE3 variant (whose image is NOT changed in this PR), so that comment remains accurate and should be left as-is.

Extended reasoning...

What the bug is

At .github/configs/nvidia-master.yaml lines 12805-12810, immediately above the minimaxm3-fp4-b300-vllm entry, there is a block comment that states:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) is baked into the perf container image, so no runtime patch is needed.

That wording was true when the config used the bespoke perf-tagged image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. This PR replaces that tag with vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 — a mainline nightly image, not a perf-variant image. After the swap, the phrase "perf container image" no longer describes what is actually being pulled.

Why the invariant still holds, but the comment is misleading

The underlying claim — "no runtime patch is needed" — is still true, because vllm-project/vllm PR #46380 has presumably landed in vLLM main and the nightly image picks it up. So there is no runtime effect and nothing breaks at benchmark time. The issue is purely that the comment cites the wrong reason: a future reader inspecting this entry will look for a "perf container image" that no longer exists in the config and be confused.

Step-by-step proof

  1. Before this PR, line 12812 reads image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. The tag literally contains minimax-m3-perf, matching "perf container image".
  2. This PR changes line 12812 to image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226. The tag is nightly-<sha>; there is no "perf" in the tag name.
  3. Lines 12805-12810 (comment above the entry) still literally say baked into the perf container image, unchanged by this PR.
  4. Therefore the comment and the code disagree about which image class is in use.

Fix

Update the comment to reflect that MiniMax-M3 NVFP4 support has landed upstream and is now picked up from the mainline nightly image, e.g.:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) has landed in vLLM main and is picked up from the nightly image below, so no runtime patch is needed.

Scope note

The same wording appears at lines 12839-12840 above the EAGLE3 variant (minimaxm3-fp4-eagle3-b300-vllm), which does not change its image in this PR. That comment remains accurate and should be left as-is; only the copy above minimaxm3-fp4-b300-vllm (lines 12805-12810) needs updating.

Severity

This is a documentation-only inconsistency with no runtime effect on the benchmark. Marking as nit.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@Ankur-singh

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@Ankur-singh

Copy link
Copy Markdown
Collaborator

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
  • Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
  • Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

  • Single-node vLLM AGG submission (minimaxm3-fp4-b300-vllm, nvidia/MiniMax-M3-NVFP4, B300). The upstream vLLM recipe PR vllm-project/recipes#577 adds the NVFP4 Blackwell (B200/B300) variant to the MiniMax-M3 recipe (MTP + non-MTP); this InferenceX PR updates the image tag + search space and enables FP8 KV cache and the trtllm all-reduce backend, consistent with that recipe variant.

Signed: ankur-singh

@Klaud-Cold

Copy link
Copy Markdown
Collaborator

Verdict: PASS — all sign-off checks independently verified at head f7286c1.

  • Check 0 (CODEOWNER): PASS — @Ankur-singh is a listed owner of .github/configs/nvidia-master.yaml; the other two changed files fall under the catch-all, covered by any recognized CODEOWNER.
  • Check 1 (sweep on in-PR commit): PASS — the PR's only commit f7286c1 carries green executed single-node 1k1k /, single-node 8k1k /, and eval / check-runs from run 28612763227.
  • Check 2 (eval accuracy): PASS — run artifacts show GSM8K em_strict ≈ 0.954 (n=1319) on all three eval lanes, on the PR's image (vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226).
  • Check 3 (recipe): PASS — linked vllm-project/recipes#577 matches all major args: nvidia/MiniMax-M3-NVFP4 on Blackwell B300, TP / TP+EP / DP+EP parallelism, NVFP4 quant (auto-detected from checkpoint), --kv-cache-dtype fp8 (recipe's FP8 KV Cache section), --block-size 128, --language-model-only. Informational only: VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm (vLLM default auto) and harness knobs (gpu-mem-util, cudagraph capture size, prefix-caching off) are InferenceX-specific tuning not in the recipe; the newer nightly image is consistent with the recipe's own note to use a nightly once vLLM #46380 lands.
  • Check 4 (reuse command): PASS — /reuse-sweep-run posted by Ankur-singh (COLLABORATOR).

@anish-shanbhag anish-shanbhag changed the title [WIP] perf: update MiniMax-M3 FP4 B300 vLLM [NV] perf: update MiniMax-M3 FP4 B300 vLLM Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants