Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark by Ankur-singh · Pull Request #1932 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-06-25T17:31:44Z

Adds the minimaxm3-fp4-b200-vllm config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc), no spec decode.

Config: nvidia-master.yaml entry (fp4 / vllm / runner b200-dgxc); sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024.
Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; --block-size 128 (MSA), --language-model-only.
Weights: pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 — added a minimaxm3 && fp4 branch to launch_b200-dgxc.sh that resolves MODEL_PATH there (the launcher rewrites MODEL to it and bind-mounts it).
perf-changelog entry appended.

New minimaxm3-fp4-b200-vllm config (fp4 vLLM aggregated on b200-dgxc). The benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve. Weights are pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh).

claude · 2026-06-25T17:41:28Z

+for f in \
+  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
+  model_executor/layers/quantization/modelopt.py \
+  model_executor/layers/quantization/utils/flashinfer_utils.py
+do
+  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
+done
+python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"


🔴 The NVFP4 overlay-patch loop at lines 27-33 downloads 3 files from raw.githubusercontent.com with no error handling: the script has no set -e, benchmark_lib.sh does not set it either, and there is no || exit after curl -fsSL. If only modelopt.py or flashinfer_utils.py fails to download (transient 5xx, rate limit, network blip), curl writes no file and the loop continues — the verification at line 34 only imports from the first file (trtllm_nvfp4_moe.py), so the failure is not caught and the benchmark dies much later inside vllm serve with an opaque unrecognized-NVFP4-quant-config error. Fix: add || exit 1 to the curl invocation, or set -euo pipefail at the top — matching the || { echo ...; exit 1; } pattern the sibling minimaxm3_fp8_b200.sh already uses.

Extended reasoning...

What the bug is

The new script benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh overlays vllm-project/vllm PR #46380 onto the installed vLLM package by curl-fetching three source files from raw.githubusercontent.com:

for f in \ model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ model_executor/layers/quantization/modelopt.py \ model_executor/layers/quantization/utils/flashinfer_utils.py do curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" done python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

The loop has no error handling. The script does not declare set -e (the only set in the file is set -x on line 65, which is shell tracing), and benchmark_lib.sh sourced at line 9 does not enable -e globally either (its only set -e/set +e calls are scoped inside a single function around lines 1265/1270). The curl invocation also has no || exit / || { …; exit 1; } trailer.

The specific failure path

curl -fsSL returns non-zero on HTTP errors (the -f flag), and crucially, with -f curl writes no output file on failure — the existing site-packages file from the image stays in place untouched. With no set -e and no explicit error check, the loop simply moves to the next iteration; the script then proceeds.

The post-patch verification at line 34 only imports TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe.py — the first file in the loop. If the second (modelopt.py) or third (flashinfer_utils.py) download fails, the verification still passes, because the original stock-vLLM files those names reference are still valid Python modules; they simply lack the NVFP4 quant-config support that PR #46380 added. The benchmark then proceeds to vllm serve, which fails opaquely much later with an unrecognized-NVFP4-quant-config error or an ImportError — far from the actual patch step.

Step-by-step proof

raw.githubusercontent.com returns a transient 503 (or rate-limits) for modelopt.py — realistic during GitHub Actions runner storms.

curl -fsSL …/modelopt.py -o …/modelopt.py exits 22, prints nothing (-s), writes nothing (-f suppresses the output file on HTTP failure). The stock modelopt.py in the image is untouched.

The for-loop ignores the non-zero exit and continues to flashinfer_utils.py (which may also be patched or original).

python3 -c "from …trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular" succeeds — file [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 was overwritten correctly, and the symbol the new file defines is importable. Prints [nvfp4-patch] OK.

vllm serve nvidia/MiniMax-M3-NVFP4 … starts. Inside vLLM, the modelopt loader is reached for the NVFP4 quant config, but the unpatched modelopt.py does not recognise the NVFP4 variant from PR #46380 — startup fails with an opaque error well after the patch step.

Why existing code doesn't prevent this

The verification command is correct for confirming file #1's overwrite, but is silent on files #2 and #3. The sibling script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh shows the project's own defensive convention for an analogous patching block — its Python heredoc patch is followed by || { echo "… patch failed" >&2; exit 1; } (around line 30). This new fp4 recipe diverges from that convention.

Impact

When the failure hits, the benchmark fails inside vllm serve startup with a confusing NVFP4-quant-config or ImportError, not at the patch step where the actual problem lives. That's the worst kind of CI failure — expensive to triage and easy to mis-attribute to a vLLM or model-config issue. raw.githubusercontent.com 5xx / rate limiting / DNS blips during runner storms are realistic, not hypothetical.

Fix

Either of the following one-line fixes works:

# Option A: fail-fast inside the loop curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" || { echo "[nvfp4-patch] failed to fetch ${f}" >&2; exit 1; } # Option B: enable strict mode at the top of the script set -euo pipefail

Option A matches the project's existing pattern in minimaxm3_fp8_b200.sh. Option B is broader and would also catch other unchecked failures (hf download, the python3 -c …vllm.__file__ lookup, etc.).

# Conflicts: # perf-changelog.yaml

functionstackx

fix patchwork as discussed in slack
missing vllm recipes

github-actions · 2026-06-25T20:11:58Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28189599852
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28189599852

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark script no longer overwrites vLLM files at runtime.

# Conflicts: # perf-changelog.yaml

github-actions · 2026-06-26T14:02:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28197744776
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28197744776

github-actions · 2026-06-26T17:58:36Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28197744776
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28197744776

Ankur-singh · 2026-06-26T21:44:09Z

/reuse-sweep-run

Ankur-singh · 2026-06-26T21:48:07Z

As a PR reviewer and CODEOWNER, I have reviewed this and have:

Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
- If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
[] If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

Recipe: Add MiniMax-M3 NVFP4 variant (MTP + non-MTP) vllm-project/recipes#577

Signed: ankur-singh

Klaud-Cold · 2026-06-26T21:50:06Z

PASS — sign-off verified at pinned head 436111d; clear to merge.

Check 0 (CODEOWNER): PASS — @Ankur-singh is a direct owner of .github/configs/nvidia-master.yaml; the other changed paths (benchmarks/..., perf-changelog.yaml, runners/...) fall under the * @InferenceX/core catch-all, covered by any recognized CODEOWNER.
Check 1 (sweep+evals on in-PR commit): PASS — head commit 436111d (in the PR) carries green, executed single-node 1k1k/8k1k * and eval / check-runs. Run 28197744776.
Check 2 (eval accuracy): PASS — 9 gsm8k eval / results, em_strict 0.948–0.959, on MiniMax-M3-NVFP4 under image vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 (matches the PR config image).
Check 3 (recipe linked + complete): PASS — linked vllm-project/recipes#577. Major args match: model nvidia/MiniMax-M3-NVFP4, Blackwell/B200, TP/EP/DP+EP (--enable-expert-parallel / --data-parallel-size), NVFP4 (auto-detected from checkpoint), --block-size 128, --language-model-only. InferenceX harness knobs (gpu-mem-util, max-num-batched-tokens, stream-interval, prefix-caching, image tag) are sweep-specific and not required to match.

# Conflicts: # perf-changelog.yaml

Ankur-singh requested a review from a team June 25, 2026 17:31

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners June 25, 2026 17:31

github-project-automation Bot added this to InferenceMAX Board Jun 25, 2026

Update perf-changelog pr-link for #1932

f2c156b

claude Bot reviewed Jun 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vllm

e65ca33

# Conflicts: # perf-changelog.yaml

Ankur-singh added the full-sweep-enabled label Jun 25, 2026

functionstackx requested changes Jun 25, 2026

View reviewed changes

Ankur-singh added 2 commits June 25, 2026 13:15

Drop runtime NVFP4 patch; bump perf image to ...-8b00f41

ef18622

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark script no longer overwrites vLLM files at runtime.

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vllm

436111d

# Conflicts: # perf-changelog.yaml

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vllm

fe2a535

# Conflicts: # perf-changelog.yaml

adibarra approved these changes Jun 26, 2026

View reviewed changes

adibarra merged commit a37dbbd into main Jun 26, 2026
27 checks passed

adibarra deleted the minimaxm3-fp4-b200-vllm branch June 26, 2026 22:31

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 26, 2026

claude Bot mentioned this pull request Jul 2, 2026

[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config #1982

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932
adibarra merged 6 commits into
mainfrom
minimaxm3-fp4-b200-vllm

Ankur-singh commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

functionstackx left a comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Ankur-singh commented Jun 26, 2026

Uh oh!

Ankur-singh commented Jun 26, 2026

Uh oh!

Klaud-Cold commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Ankur-singh commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Ankur-singh commented Jun 26, 2026

Uh oh!

Ankur-singh commented Jun 26, 2026

Additional detail section:

Uh oh!

Klaud-Cold commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants