Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300.sh
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ SERVER_LOG=/workspace/server.log

export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_FLOAT32_MATMUL_PRECISION=high
export VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm

if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel"
Expand All @@ -56,8 +57,9 @@ start_gpu_monitor
set -x
vllm serve "$MODEL_PATH" --served-model-name "$MODEL" --host 0.0.0.0 --port $PORT \
$PARALLEL_ARGS \
--gpu-memory-utilization 0.90 \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--kv-cache-dtype fp8 \
--block-size 128 \
--language-model-only \
--max-cudagraph-capture-size 2048 \
Expand Down
23 changes: 10 additions & 13 deletions configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12809,7 +12809,7 @@ minimaxm3-fp8-b300-vllm:
# weights are pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added to
# the STAGED_MODELS allow-list in launch_b300-nv.sh).
minimaxm3-fp4-b300-vllm:
image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The block comment just above this line still says the NVFP4 support 'is baked into the perf container image', but the PR swaps the image from the bespoke vllm-minimax-m3-perf-* tag to a mainline nightly-* tag. Update the comment to say the support has landed in vLLM main and is picked up from nightly so future readers don't chase a nonexistent perf image. Note: the same wording appears at lines 12839-12840 for the EAGLE3 variant (whose image is NOT changed in this PR), so that comment remains accurate and should be left as-is.

Extended reasoning...

What the bug is

At .github/configs/nvidia-master.yaml lines 12805-12810, immediately above the minimaxm3-fp4-b300-vllm entry, there is a block comment that states:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) is baked into the perf container image, so no runtime patch is needed.

That wording was true when the config used the bespoke perf-tagged image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. This PR replaces that tag with vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 — a mainline nightly image, not a perf-variant image. After the swap, the phrase "perf container image" no longer describes what is actually being pulled.

Why the invariant still holds, but the comment is misleading

The underlying claim — "no runtime patch is needed" — is still true, because vllm-project/vllm PR #46380 has presumably landed in vLLM main and the nightly image picks it up. So there is no runtime effect and nothing breaks at benchmark time. The issue is purely that the comment cites the wrong reason: a future reader inspecting this entry will look for a "perf container image" that no longer exists in the config and be confused.

Step-by-step proof

  1. Before this PR, line 12812 reads image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. The tag literally contains minimax-m3-perf, matching "perf container image".
  2. This PR changes line 12812 to image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226. The tag is nightly-<sha>; there is no "perf" in the tag name.
  3. Lines 12805-12810 (comment above the entry) still literally say baked into the perf container image, unchanged by this PR.
  4. Therefore the comment and the code disagree about which image class is in use.

Fix

Update the comment to reflect that MiniMax-M3 NVFP4 support has landed upstream and is now picked up from the mainline nightly image, e.g.:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) has landed in vLLM main and is picked up from the nightly image below, so no runtime patch is needed.

Scope note

The same wording appears at lines 12839-12840 above the EAGLE3 variant (minimaxm3-fp4-eagle3-b300-vllm), which does not change its image in this PR. That comment remains accurate and should be left as-is; only the copy above minimaxm3-fp4-b300-vllm (lines 12805-12810) needs updating.

Severity

This is a documentation-only inconsistency with no runtime effect on the benchmark. Marking as nit.

model: nvidia/MiniMax-M3-NVFP4
model-prefix: minimaxm3
runner: b300
Expand All @@ -12821,22 +12821,19 @@ minimaxm3-fp4-b300-vllm:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 512 }
- { tp: 4, conc-start: 1, conc-end: 64 }
- { tp: 8, conc-start: 1, conc-end: 2 }
- { tp: 4, conc-start: 1, conc-end: 2 }
- { tp: 2, conc-start: 4, conc-end: 256 }
- { tp: 4, conc-start: 64, conc-end: 64 }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 512 }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 128, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 16, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }
- { tp: 2, ep: 2, dp-attn: true, conc-start: 512, conc-end: 512 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64 }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 512 }
- { tp: 4, conc-start: 1, conc-end: 128 }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 256 }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256 }
- { tp: 8, conc-start: 1, conc-end: 2 }
- { tp: 4, conc-start: 1, conc-end: 2 }
- { tp: 2, conc-start: 4, conc-end: 256 }
- { tp: 2, ep: 2, dp-attn: true, conc-start: 512, conc-end: 512 }

# EAGLE3 speculative-decoding (spec-decoding: mtp) variant of MiniMax-M3 NVFP4
# (nvidia/MiniMax-M3-NVFP4) B300 single-node vLLM, pairing the target with the
Expand Down
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4426,3 +4426,10 @@
- "Use 1k/1k TP4/EP1 c1-c32 and TP4/EP4 c64-c256; use 8k/1k TP4/EP1 c1-c32 and TP4/EP4 DP-attention c64-c256."
- "Drop the TP8/EP8 single-concurrency points."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1975

- config-keys:
- minimaxm3-fp4-b300-vllm
description:
- "Update Minimax M3 b300 vllm image tag"
- "Update search space to cover more configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1990