Skip to content

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932

Merged
adibarra merged 6 commits into
mainfrom
minimaxm3-fp4-b200-vllm
Jun 26, 2026
Merged

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932
adibarra merged 6 commits into
mainfrom
minimaxm3-fp4-b200-vllm

Conversation

@Ankur-singh

Copy link
Copy Markdown
Collaborator

Adds the minimaxm3-fp4-b200-vllm config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc), no spec decode.

  • Config: nvidia-master.yaml entry (fp4 / vllm / runner b200-dgxc); sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024.
  • Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; --block-size 128 (MSA), --language-model-only.
  • Weights: pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 — added a minimaxm3 && fp4 branch to launch_b200-dgxc.sh that resolves MODEL_PATH there (the launcher rewrites MODEL to it and bind-mounts it).
  • perf-changelog entry appended.

New minimaxm3-fp4-b200-vllm config (fp4 vLLM aggregated on b200-dgxc). The
benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4
support, commit 6c08558) before serve. Weights are pre-staged at
/scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to
launch_b200-dgxc.sh).
Comment on lines +27 to +34
for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The NVFP4 overlay-patch loop at lines 27-33 downloads 3 files from raw.githubusercontent.com with no error handling: the script has no set -e, benchmark_lib.sh does not set it either, and there is no || exit after curl -fsSL. If only modelopt.py or flashinfer_utils.py fails to download (transient 5xx, rate limit, network blip), curl writes no file and the loop continues — the verification at line 34 only imports from the first file (trtllm_nvfp4_moe.py), so the failure is not caught and the benchmark dies much later inside vllm serve with an opaque unrecognized-NVFP4-quant-config error. Fix: add || exit 1 to the curl invocation, or set -euo pipefail at the top — matching the || { echo ...; exit 1; } pattern the sibling minimaxm3_fp8_b200.sh already uses.

Extended reasoning...

What the bug is

The new script benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh overlays vllm-project/vllm PR #46380 onto the installed vLLM package by curl-fetching three source files from raw.githubusercontent.com:

for f in \
  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
  model_executor/layers/quantization/modelopt.py \
  model_executor/layers/quantization/utils/flashinfer_utils.py
do
  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

The loop has no error handling. The script does not declare set -e (the only set in the file is set -x on line 65, which is shell tracing), and benchmark_lib.sh sourced at line 9 does not enable -e globally either (its only set -e/set +e calls are scoped inside a single function around lines 1265/1270). The curl invocation also has no || exit / || { …; exit 1; } trailer.

The specific failure path

curl -fsSL returns non-zero on HTTP errors (the -f flag), and crucially, with -f curl writes no output file on failure — the existing site-packages file from the image stays in place untouched. With no set -e and no explicit error check, the loop simply moves to the next iteration; the script then proceeds.

The post-patch verification at line 34 only imports TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe.py — the first file in the loop. If the second (modelopt.py) or third (flashinfer_utils.py) download fails, the verification still passes, because the original stock-vLLM files those names reference are still valid Python modules; they simply lack the NVFP4 quant-config support that PR #46380 added. The benchmark then proceeds to vllm serve, which fails opaquely much later with an unrecognized-NVFP4-quant-config error or an ImportError — far from the actual patch step.

Step-by-step proof

  1. raw.githubusercontent.com returns a transient 503 (or rate-limits) for modelopt.py — realistic during GitHub Actions runner storms.
  2. curl -fsSL …/modelopt.py -o …/modelopt.py exits 22, prints nothing (-s), writes nothing (-f suppresses the output file on HTTP failure). The stock modelopt.py in the image is untouched.
  3. The for-loop ignores the non-zero exit and continues to flashinfer_utils.py (which may also be patched or original).
  4. python3 -c "from …trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular" succeeds — file [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 was overwritten correctly, and the symbol the new file defines is importable. Prints [nvfp4-patch] OK.
  5. vllm serve nvidia/MiniMax-M3-NVFP4 … starts. Inside vLLM, the modelopt loader is reached for the NVFP4 quant config, but the unpatched modelopt.py does not recognise the NVFP4 variant from PR #46380 — startup fails with an opaque error well after the patch step.

Why existing code doesn't prevent this

The verification command is correct for confirming file #1's overwrite, but is silent on files #2 and #3. The sibling script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh shows the project's own defensive convention for an analogous patching block — its Python heredoc patch is followed by || { echo "… patch failed" >&2; exit 1; } (around line 30). This new fp4 recipe diverges from that convention.

Impact

When the failure hits, the benchmark fails inside vllm serve startup with a confusing NVFP4-quant-config or ImportError, not at the patch step where the actual problem lives. That's the worst kind of CI failure — expensive to triage and easy to mis-attribute to a vLLM or model-config issue. raw.githubusercontent.com 5xx / rate limiting / DNS blips during runner storms are realistic, not hypothetical.

Fix

Either of the following one-line fixes works:

# Option A: fail-fast inside the loop
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" || { echo "[nvfp4-patch] failed to fetch ${f}" >&2; exit 1; }

# Option B: enable strict mode at the top of the script
set -euo pipefail

Option A matches the project's existing pattern in minimaxm3_fp8_b200.sh. Option B is broader and would also catch other unchecked failures (hf download, the python3 -c …vllm.__file__ lookup, etc.).

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. fix patchwork as discussed in slack
  2. missing vllm recipes

@github-actions

Copy link
Copy Markdown
Contributor

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in
MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark
script no longer overwrites vLLM files at runtime.
@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

@Ankur-singh

Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@Ankur-singh

Copy link
Copy Markdown
Collaborator Author

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
  • Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
  • Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • [] If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

Signed: ankur-singh

@Klaud-Cold

Copy link
Copy Markdown
Collaborator

PASS — sign-off verified at pinned head 436111d; clear to merge.

  • Check 0 (CODEOWNER): PASS — @Ankur-singh is a direct owner of .github/configs/nvidia-master.yaml; the other changed paths (benchmarks/..., perf-changelog.yaml, runners/...) fall under the * @InferenceX/core catch-all, covered by any recognized CODEOWNER.
  • Check 1 (sweep+evals on in-PR commit): PASS — head commit 436111d (in the PR) carries green, executed single-node 1k1k/8k1k * and eval / check-runs. Run 28197744776.
  • Check 2 (eval accuracy): PASS — 9 gsm8k eval / results, em_strict 0.948–0.959, on MiniMax-M3-NVFP4 under image vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 (matches the PR config image).
  • Check 3 (recipe linked + complete): PASS — linked vllm-project/recipes#577. Major args match: model nvidia/MiniMax-M3-NVFP4, Blackwell/B200, TP/EP/DP+EP (--enable-expert-parallel / --data-parallel-size), NVFP4 (auto-detected from checkpoint), --block-size 128, --language-model-only. InferenceX harness knobs (gpu-mem-util, max-num-batched-tokens, stream-interval, prefix-caching, image tag) are sweep-specific and not required to match.

@adibarra adibarra merged commit a37dbbd into main Jun 26, 2026
27 checks passed
@adibarra adibarra deleted the minimaxm3-fp4-b200-vllm branch June 26, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants