[NV] perf: update MiniMax-M3 FP4 B300 vLLM by anish-shanbhag · Pull Request #1990 · SemiAnalysisAI/InferenceX

anish-shanbhag · 2026-07-02T18:28:35Z

No description provided.

github-actions · 2026-07-02T18:28:43Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-07-02T18:33:29Z

+- config-keys:
+    - minimaxm3-fp4-b300-vllm
+  description:
+    - "Update Minimax M3 b300 vllm image tag"
+    - "Update search space to cover more configs"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/tree/codex/minimax-m3-b300-fp4-vllm-update


🟡 The new perf-changelog.yaml entry has two issues: (1) pr-link uses a branch URL (tree/codex/minimax-m3-b300-fp4-vllm-update) instead of the canonical pull/1990 form used by every other entry in the file — this will 404 once the branch is deleted after merge; (2) the description says "Update search space to cover more configs", but the diff actually narrows the sweep (isl 1024: 7→4 entries, isl 8192: 6→4 entries, all TP8/EP8 and TP4/EP4 lanes dropped). Consider phrasing similar to the neighboring dsr1-fp4-b200-sglang entry (e.g. "Refocus/narrow the search space and drop TP8/EP8 and TP4/EP4 sweeps").

Extended reasoning...

Issue 1 — branch URL in pr-link:\n\nThe new entry at perf-changelog.yaml line 4435 sets:\n\nyaml\npr-link: https://github.com/SemiAnalysisAI/InferenceX/tree/codex/minimax-m3-b300-fp4-vllm-update\n\n\nOf the ~554 pr-link values in this file, this is the only one that points to a branch tree/... URL — all others use the canonical https://github.com/SemiAnalysisAI/InferenceX/pull/<num> form (see the four most recent entries at lines 4402–4428 for examples). Since branches are typically deleted after merge, this link will 404 shortly after this PR lands. The correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/1990.\n\nIssue 2 — description contradicts the diff:\n\nThe entry claims:\n\nyaml\n- "Update search space to cover more configs"\n\n\nBut the diff clearly narrows the sweep. Step-by-step count from the diff on .github/configs/nvidia-master.yaml:\n\nisl 1024, osl 1024:\n- Before (7 entries): {tp:8, c1-64}, {tp:8/ep:8, c1-512}, {tp:4, c1-64}, {tp:4/ep:4, c64-512}, {tp:4/ep:4/dp-attn, c128-512}, {tp:2/ep:2, c16-128}, {tp:8/ep:8/dp-attn, c256-1024}\n- After (4 entries): {tp:8, c1-2}, {tp:4, c2-2}, {tp:2, c4-256}, {tp:2/ep:2/dp-attn, c512-1024}\n\nisl 8192, osl 1024:\n- Before (6 entries): {tp:8, c1-64}, {tp:8/ep:8, c1-512}, {tp:4, c1-128}, {tp:4/ep:4, c64-256}, {tp:4/ep:4/dp-attn, c64-128}, {tp:8/ep:8/dp-attn, c128-256}\n- After (4 entries): {tp:8, c1-2}, {tp:4, c2-2}, {tp:2, c4-256}, {tp:2/ep:2/dp-attn, c512-1024}\n\nEvery TP8/EP8 and TP4/EP4 lane is dropped, and the TP8/TP4 lanes are drastically clipped (tp:8 from c1-64 down to c1-2; tp:4 from c1-64 down to a single c2 point). While tp:2 and tp:2/ep:2/dp-attn are genuinely new, the aggregate coverage is significantly smaller (roughly 38 concurrency points → ~12 at isl 1024). Calling this "cover more configs" is the opposite of what the change actually does.\n\nImpact:\n\nBoth issues are purely documentation quality — no runtime effect, and no CI schema check enforces the URL form. But the perf-changelog is a user-facing history document that people read to understand why a config changed; a stale link plus a misleading description defeats that purpose.\n\nFix:\n\nyaml\n- config-keys:\n - minimaxm3-fp4-b300-vllm\n description:\n - "Update Minimax M3 b300 vllm image tag"\n - "Refocus/narrow the search space and drop TP8/EP8 and TP4/EP4 sweeps."\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1990\n\n\nThe neighboring dsr1-fp4-b200-sglang entry at lines 4422–4428 uses similar honest phrasing when reducing coverage ("Drop the TP8/EP8 single-concurrency points.").

claude · 2026-07-02T18:33:31Z

@@ -12809,7 +12809,7 @@ minimaxm3-fp8-b300-vllm:
 # weights are pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added to
 # the STAGED_MODELS allow-list in launch_b300-nv.sh).
 minimaxm3-fp4-b300-vllm:
-  image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
+  image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226


🟡 The block comment just above this line still says the NVFP4 support 'is baked into the perf container image', but the PR swaps the image from the bespoke vllm-minimax-m3-perf-* tag to a mainline nightly-* tag. Update the comment to say the support has landed in vLLM main and is picked up from nightly so future readers don't chase a nonexistent perf image. Note: the same wording appears at lines 12839-12840 for the EAGLE3 variant (whose image is NOT changed in this PR), so that comment remains accurate and should be left as-is.

Extended reasoning...

What the bug is

At .github/configs/nvidia-master.yaml lines 12805-12810, immediately above the minimaxm3-fp4-b300-vllm entry, there is a block comment that states:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) is baked into the perf container image, so no runtime patch is needed.

That wording was true when the config used the bespoke perf-tagged image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. This PR replaces that tag with vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 — a mainline nightly image, not a perf-variant image. After the swap, the phrase "perf container image" no longer describes what is actually being pulled.

Why the invariant still holds, but the comment is misleading

The underlying claim — "no runtime patch is needed" — is still true, because vllm-project/vllm PR #46380 has presumably landed in vLLM main and the nightly image picks it up. So there is no runtime effect and nothing breaks at benchmark time. The issue is purely that the comment cites the wrong reason: a future reader inspecting this entry will look for a "perf container image" that no longer exists in the config and be confused.

Step-by-step proof

Before this PR, line 12812 reads image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. The tag literally contains minimax-m3-perf, matching "perf container image".

This PR changes line 12812 to image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226. The tag is nightly-<sha>; there is no "perf" in the tag name.

Lines 12805-12810 (comment above the entry) still literally say baked into the perf container image, unchanged by this PR.

Therefore the comment and the code disagree about which image class is in use.

Fix

Update the comment to reflect that MiniMax-M3 NVFP4 support has landed upstream and is now picked up from the mainline nightly image, e.g.:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) has landed in vLLM main and is picked up from the nightly image below, so no runtime patch is needed.

Scope note

The same wording appears at lines 12839-12840 above the EAGLE3 variant (minimaxm3-fp4-eagle3-b300-vllm), which does not change its image in this PR. That comment remains accurate and should be left as-is; only the copy above minimaxm3-fp4-b300-vllm (lines 12805-12810) needs updating.

Severity

This is a documentation-only inconsistency with no runtime effect on the benchmark. Marking as nit.

github-actions · 2026-07-02T19:53:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28612763227
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28612763227

Ankur-singh · 2026-07-02T20:03:13Z

/reuse-sweep-run

Ankur-singh · 2026-07-02T20:08:01Z

As a PR reviewer and CODEOWNER, I have reviewed this and have:

Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
- If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

Single-node vLLM AGG submission (minimaxm3-fp4-b300-vllm, nvidia/MiniMax-M3-NVFP4, B300). The upstream vLLM recipe PR vllm-project/recipes#577 adds the NVFP4 Blackwell (B200/B300) variant to the MiniMax-M3 recipe (MTP + non-MTP); this InferenceX PR updates the image tag + search space and enables FP8 KV cache and the trtllm all-reduce backend, consistent with that recipe variant.

Signed: ankur-singh

Klaud-Cold · 2026-07-02T20:12:13Z

Verdict: PASS — all sign-off checks independently verified at head f7286c1.

Check 0 (CODEOWNER): PASS — @Ankur-singh is a listed owner of .github/configs/nvidia-master.yaml; the other two changed files fall under the catch-all, covered by any recognized CODEOWNER.
Check 1 (sweep on in-PR commit): PASS — the PR's only commit f7286c1 carries green executed single-node 1k1k /, single-node 8k1k /, and eval / check-runs from run 28612763227.
Check 2 (eval accuracy): PASS — run artifacts show GSM8K em_strict ≈ 0.954 (n=1319) on all three eval lanes, on the PR's image (vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226).
Check 3 (recipe): PASS — linked vllm-project/recipes#577 matches all major args: nvidia/MiniMax-M3-NVFP4 on Blackwell B300, TP / TP+EP / DP+EP parallelism, NVFP4 quant (auto-detected from checkpoint), --kv-cache-dtype fp8 (recipe's FP8 KV Cache section), --block-size 128, --language-model-only. Informational only: VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm (vLLM default auto) and harness knobs (gpu-mem-util, cudagraph capture size, prefix-caching off) are InferenceX-specific tuning not in the recipe; the newer nightly image is consistent with the recipe's own note to use a nightly once vLLM #46380 lands.
Check 4 (reuse command): PASS — /reuse-sweep-run posted by Ankur-singh (COLLABORATOR).

perf: update MiniMax-M3 FP4 B300 vLLM

f7286c1

anish-shanbhag requested a review from a team July 2, 2026 18:28

anish-shanbhag requested review from Ankur-singh, jgangani and kedarpotdar-nv as code owners July 2, 2026 18:28

github-project-automation Bot added this to InferenceMAX Board Jul 2, 2026

anish-shanbhag added the full-sweep-fail-fast label Jul 2, 2026

claude Bot reviewed Jul 2, 2026

View reviewed changes

Ankur-singh approved these changes Jul 2, 2026

View reviewed changes

anish-shanbhag added 2 commits July 2, 2026 13:19

matrix

42bb823

matrix

3269312

anish-shanbhag changed the title ~~[WIP] perf: update MiniMax-M3 FP4 B300 vLLM~~ [NV] perf: update MiniMax-M3 FP4 B300 vLLM Jul 2, 2026

merge main

cc1315e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NV] perf: update MiniMax-M3 FP4 B300 vLLM#1990

[NV] perf: update MiniMax-M3 FP4 B300 vLLM#1990
anish-shanbhag wants to merge 4 commits into
mainfrom
codex/minimax-m3-b300-fp4-vllm-update

anish-shanbhag commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Ankur-singh commented Jul 2, 2026

Uh oh!

Ankur-singh commented Jul 2, 2026

Uh oh!

Klaud-Cold commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

anish-shanbhag commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

What the bug is

Why the invariant still holds, but the comment is misleading

Step-by-step proof

Fix

Scope note

Severity

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Ankur-singh commented Jul 2, 2026

Uh oh!

Ankur-singh commented Jul 2, 2026

Additional detail section:

Uh oh!

Klaud-Cold commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants