[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config by jasonlizhengjian · Pull Request #1982 · SemiAnalysisAI/InferenceX

jasonlizhengjian · 2026-07-02T14:40:46Z

Summary

add a MiniMax-M3 NVFP4 B200 Dynamo-vLLM disaggregated master configuration using the b200-multinode runner
port the 8k1k B300 DEP2/DEP8 recipe to a B200 2P1D topology at concurrency 2048
stage the new MiniMax-M3 B200 recipe through the B200 DGXC Slurm launcher
omit max-cudagraph-capture-size and max-num-batched-tokens from the prefill configuration

Configuration

prefill: 2 workers, DEP2, 2 GPUs per worker, 1 node
decode: 1 worker, DEP8, 8 GPUs, 1 node
workload: 8192 input tokens / 1024 output tokens, concurrency 2048

Validation

exact generate_sweep_configs.py slice for MiniMax-M3 FP4 Dynamo-vLLM on b200-multinode
python -m pytest utils/matrix_logic/ -v (163 passed)
recipe topology and removed-prefill-field assertions
bash -n runners/launch_b200-dgxc-slurm.sh
git diff --check

github-actions · 2026-07-02T14:40:58Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-07-02T14:47:46Z

Claude finished @jasonlizhengjian's task in 3m 55s —— View job

Review: MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config

Gather context (diff, launcher routing, source recipe)
Validate master config ↔ perf-changelog coupling
Validate image accessibility
Validate launcher routing + recipe path
Review recipe correctness vs. ported B300 recipe

LGTM — no blocking issues found. This is a clean, well-structured port. Details verified:

Master config + changelog: nvidia-master.yaml was edited and perf-changelog.yaml has a matching entry appended at the end with the PR link. ✅
Image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 is a public Docker Hub reference, and the master image: matches the recipe container:. ✅
Launcher routing: the new dynamo-vllm/minimaxm3/fp4 branch in launch_b200-dgxc.sh copies .../minimax-m3/b200-fp4 → recipes/vllm/minimax-m3/b200-fp4, which matches the CONFIG_FILE=recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml in the config. The MODEL_PATH/SRT_SLURM_MODEL_PREFIX override for minimaxm3+fp4 already exists on main. Note: launch_b200-dgxc-slurm.sh (referenced in the PR description) is a symlink to launch_b200-dgxc.sh, so the single edit covers both names. ✅
Seq-len budget: max-model-len: 9472 ≥ ISL 8192 + OSL 1024 = 9216. ✅
Topology: prefill 2×2 GPUs + decode 1×8 GPUs on 1+1 nodes is internally consistent between the resources block and data-parallel-size. Decode max-num-seqs: 1024 is per DP rank (×8 = 8192 aggregate), comfortably above concurrency 4096. ✅

Minor (non-blocking, FYI only): the perf-changelog.yaml description says "Port the B300 4P2D DEP2/DEP8 recipe", but the actual source is the 2P1D recipe (b300-fp8/1k1k/2p1d-dep2-dep8-1k1k.yaml) and the new topology is 2P1D. Worth correcting the "4P2D" wording for accuracy, but it doesn't affect behavior.
· branch codex/add-minimaxm3-fp4-b200-dynamo-vllm

github-actions · 2026-07-02T14:50:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28599128708
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28599128708

claude · 2026-07-02T14:53:14Z

Claude finished @jasonlizhengjian's task in 1m 47s —— View job

Re-review: MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config

Gather context (diff, launcher routing, source recipe)
Validate master config ↔ perf-changelog coupling
Validate image accessibility
Validate launcher routing + recipe path
Review recipe correctness & topology

LGTM — no blocking issues found.

Since the previous review, the only new change is commit 3d84ee1f (concurrency 4096 → 2048). Verified against the B300 source (b300-fp4/8k1k/2p2d-dep2-dep8-8k1k.yaml, diff is clean):

Concurrency: master config conc-list: [2048] matches recipe concurrencies: "2048". Decode aggregate max-num-seqs = 1024 × DP8 = 8192, comfortably above 2048; max-cudagraph-capture-size: 4096 ≥ concurrency. ✅
Topology: prefill 2 workers × 2 GPUs (DP2) = 4 GPUs / 1 node; decode 1 worker × DP8 = 8 GPUs / 1 node. resources block and data-parallel-size values are internally consistent. ✅
Prefill omissions: max-cudagraph-capture-size and max-num-batched-tokens correctly dropped from prefill only (kept in decode), matching the PR description. ✅
Launcher + config path: new dynamo-vllm/minimaxm3/fp4 branch copies .../minimax-m3/b200-fp4 → recipes/vllm/minimax-m3/b200-fp4, resolving CONFIG_FILE=recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml. ✅
Master config / changelog / image: master edited + perf-changelog.yaml entry appended at the end with PR link; image: is a public Docker Hub reference matching the recipe container:; max-model-len: 9472 ≥ 9216 (ISL 8192 + OSL 1024). ✅

Minor (non-blocking, still open from prior review): the perf-changelog.yaml description says "Port the B300 4P2D DEP2/DEP8 recipe", but the cleanest source match and new topology are 2P2D → 2P1D. Cosmetic only.
· branch codex/add-minimaxm3-fp4-b200-dynamo-vllm

github-actions · 2026-07-02T14:53:30Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28599246606
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28599246606

claude · 2026-07-02T15:08:06Z

+model:
+  path: "nvidia/MiniMax-M3-NVFP4"
+  container: "vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41"
+  precision: "fp4"


🔴 The launcher's model-path alias key and the recipe's model.path don't match for this new b200-dgxc + minimaxm3-fp4 pairing: runners/launch_b200-dgxc.sh:74 exports SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" but the new recipe at line 4 uses path: "nvidia/MiniMax-M3-NVFP4", so srtctl'''s model_paths lookup misses. Every other analogous case in the tree (b300-nv minimaxm3-fp4, b200 minimaxm2.5-fp4/fp8, b200 dsv4-fp4) has the launcher prefix exactly equal to the recipe path. Fix by changing one side to match the other — e.g. set SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4" for the minimaxm3-fp4 branch on b200-dgxc (mirroring runners/launch_b300-nv.sh:52), or change the new recipe'''s model.path to "minimax-m3-nvfp4".

Extended reasoning...

The mismatch

runners/launch_b200-dgxc.sh:71-74 (pre-existing from PR #1932) sets up the minimaxm3-fp4 model resolution:

elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree. export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4" export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"

The launcher then writes srtslurm.yaml (around line 157):

model_paths: "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4"

But the new recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml:4 uses:

model: path: "nvidia/MiniMax-M3-NVFP4"

srtctl looks up "nvidia/MiniMax-M3-NVFP4" in model_paths — the only registered alias is "minimax-m3-nvfp4", so the lookup misses.

Why this PR is the trigger

The pre-existing lines 71-74 were previously only exercised through the single-node code path (the else branch at the bottom of the launcher), which never touches SRT_SLURM_MODEL_PREFIX — it mounts $MODEL_PATH directly via --container-mounts and sets MODEL=$MODEL_PATH. So the mismatch was benign.

This PR adds the new elif at launch_b200-dgxc.sh:116-121 that first routes minimaxm3-fp4 through the srtctl / srt-slurm code path, which is the path that actually consumes SRT_SLURM_MODEL_PREFIX as an alias key in srtslurm.yaml. So the pre-existing but previously-latent mismatch becomes load-bearing exactly at this PR.

Cross-check with every other b200/b300 case

Launcher case SRT_SLURM_MODEL_PREFIX Recipe model.path Match?

b300-nv minimaxm3-fp4 (launch_b300-nv.sh:52) nvidia/MiniMax-M3-NVFP4 nvidia/MiniMax-M3-NVFP4 ✅

b300-nv minimaxm3-fp8 (launch_b300-nv.sh:55) MiniMaxAI/MiniMax-M3-MXFP8 MiniMaxAI/MiniMax-M3-MXFP8 ✅

b200-dgxc minimaxm2.5-fp4 minimax-m2.5-nvfp4 minimax-m2.5-nvfp4 ✅

b200-dgxc minimaxm2.5-fp8 minimax-m2.5-fp8 minimax-m2.5-fp8 ✅

b200-dgxc dsv4-fp4 deepseek-v4-pro deepseek-v4-pro ✅

b200-dgxc minimaxm3-fp4 (this PR) minimax-m3-nvfp4 nvidia/MiniMax-M3-NVFP4 ❌

Every other pairing in the tree matches exactly; the new b200-dgxc minimaxm3-fp4 case is the sole outlier. The b300 minimaxm3-fp4 case in particular is instructive because the new b200 recipe is a direct port of the b300 4p2d-dep2-dep8-8k1k recipe (per the PR description), so it inherits nvidia/MiniMax-M3-NVFP4 — which matches on b300 but not on b200.

Step-by-step proof of the failure

CI dispatches minimaxm3-fp4-b200-dynamo-vllm on the b200-multinode runner.

launch_b200-dgxc.sh runs with IS_MULTINODE=true, MODEL_PREFIX=minimaxm3, PRECISION=fp4, FRAMEWORK=dynamo-vllm.

Line 74 exports SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" and MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4".

The new elif at lines 116-121 fires, clones srt-slurm, and copies the recipe into recipes/vllm/minimax-m3/b200-fp4/.

The cat > srtslurm.yaml <<EOF block writes model_paths: { "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4" }.

srtctl apply -f $CONFIG_FILE is invoked; srtctl parses the recipe and reads model.path: "nvidia/MiniMax-M3-NVFP4".

srtctl checks model_paths for the key "nvidia/MiniMax-M3-NVFP4" — not present.

Outcome A: srtctl errors on unknown alias and the job fails immediately at srtctl apply. Outcome B: srtctl treats the unmatched value as a HuggingFace hub identifier and attempts to download nvidia/MiniMax-M3-NVFP4 from the hub on every job invocation, negating the pre-staging that the comment on launch_b200-dgxc.sh:72 explicitly relies on.

Either outcome makes the full-sweep check fail — either the job errors out at model resolution, or the HF pull blows the container FS / times out the runner. The PR is labeled full-sweep-fail-fast-no-canary, so this will surface as a fail-fast failure.

Fix

One-liner in either direction. The b300 side of the tree is the reference implementation, so the least surprising change is to line 74 of runners/launch_b200-dgxc.sh:

elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree. export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4" - export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" + export SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4"

This also matches runners/launch_b300-nv.sh:52 verbatim, keeping the two clusters consistent for the same model+precision.

github-actions · 2026-07-02T15:37:07Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28599474801
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28599474801

feat: add MiniMax-M3 FP4 B200 disagg config

cdb22b2

github-project-automation Bot added this to InferenceMAX Board Jul 2, 2026

chore: add MiniMax-M3 B200 benchmark trigger

d6a1d6a

jasonlizhengjian marked this pull request as ready for review July 2, 2026 14:47

jasonlizhengjian requested a review from a team July 2, 2026 14:47

jasonlizhengjian requested review from Ankur-singh, jgangani and kedarpotdar-nv as code owners July 2, 2026 14:47

jasonlizhengjian added the full-sweep-fail-fast-no-canary Full sweep, no canary gate; first failure in a matrix cancels that matrix label Jul 2, 2026

jasonlizhengjian changed the title ~~[WIP] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config~~ [WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config Jul 2, 2026

fix: scale B200 disagg concurrency

3d84ee1

jasonlizhengjian marked this pull request as draft July 2, 2026 14:49

jasonlizhengjian marked this pull request as ready for review July 2, 2026 14:52

claude Bot reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config#1982

[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config#1982
jasonlizhengjian wants to merge 3 commits into
mainfrom
codex/add-minimaxm3-fp4-b200-dynamo-vllm

jasonlizhengjian commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Launcher case	`SRT_SLURM_MODEL_PREFIX`	Recipe `model.path`	Match?
b300-nv minimaxm3-fp4 (`launch_b300-nv.sh:52`)	`nvidia/MiniMax-M3-NVFP4`	`nvidia/MiniMax-M3-NVFP4`	✅
b300-nv minimaxm3-fp8 (`launch_b300-nv.sh:55`)	`MiniMaxAI/MiniMax-M3-MXFP8`	`MiniMaxAI/MiniMax-M3-MXFP8`	✅
b200-dgxc minimaxm2.5-fp4	`minimax-m2.5-nvfp4`	`minimax-m2.5-nvfp4`	✅
b200-dgxc minimaxm2.5-fp8	`minimax-m2.5-fp8`	`minimax-m2.5-fp8`	✅
b200-dgxc dsv4-fp4	`deepseek-v4-pro`	`deepseek-v4-pro`	✅
b200-dgxc minimaxm3-fp4 (this PR)	`minimax-m3-nvfp4`	`nvidia/MiniMax-M3-NVFP4`	❌

Uh oh!

Conversation

jasonlizhengjian commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Configuration

Validation

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

The mismatch

Why this PR is the trigger

Cross-check with every other b200/b300 case

Step-by-step proof of the failure

Fix

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jasonlizhengjian commented Jul 2, 2026 •

edited

Loading

claude Bot commented Jul 2, 2026 •

edited

Loading

claude Bot commented Jul 2, 2026 •

edited

Loading