Skip to content

[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config#1982

Open
jasonlizhengjian wants to merge 3 commits into
mainfrom
codex/add-minimaxm3-fp4-b200-dynamo-vllm
Open

[WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config#1982
jasonlizhengjian wants to merge 3 commits into
mainfrom
codex/add-minimaxm3-fp4-b200-dynamo-vllm

Conversation

@jasonlizhengjian

@jasonlizhengjian jasonlizhengjian commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a MiniMax-M3 NVFP4 B200 Dynamo-vLLM disaggregated master configuration using the b200-multinode runner
  • port the 8k1k B300 DEP2/DEP8 recipe to a B200 2P1D topology at concurrency 2048
  • stage the new MiniMax-M3 B200 recipe through the B200 DGXC Slurm launcher
  • omit max-cudagraph-capture-size and max-num-batched-tokens from the prefill configuration

Configuration

  • prefill: 2 workers, DEP2, 2 GPUs per worker, 1 node
  • decode: 1 worker, DEP8, 8 GPUs, 1 node
  • workload: 8192 input tokens / 1024 output tokens, concurrency 2048

Validation

  • exact generate_sweep_configs.py slice for MiniMax-M3 FP4 Dynamo-vLLM on b200-multinode
  • python -m pytest utils/matrix_logic/ -v (163 passed)
  • recipe topology and removed-prefill-field assertions
  • bash -n runners/launch_b200-dgxc-slurm.sh
  • git diff --check

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@jasonlizhengjian jasonlizhengjian marked this pull request as ready for review July 2, 2026 14:47
@jasonlizhengjian jasonlizhengjian requested a review from a team July 2, 2026 14:47
@jasonlizhengjian jasonlizhengjian added the full-sweep-fail-fast-no-canary Full sweep, no canary gate; first failure in a matrix cancels that matrix label Jul 2, 2026
@jasonlizhengjian jasonlizhengjian changed the title [WIP] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config [WIP] [do not merge] Add MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config Jul 2, 2026
@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @jasonlizhengjian's task in 3m 55s —— View job


Review: MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config

  • Gather context (diff, launcher routing, source recipe)
  • Validate master config ↔ perf-changelog coupling
  • Validate image accessibility
  • Validate launcher routing + recipe path
  • Review recipe correctness vs. ported B300 recipe

LGTM — no blocking issues found. This is a clean, well-structured port. Details verified:

  • Master config + changelog: nvidia-master.yaml was edited and perf-changelog.yaml has a matching entry appended at the end with the PR link. ✅
  • Image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 is a public Docker Hub reference, and the master image: matches the recipe container:. ✅
  • Launcher routing: the new dynamo-vllm/minimaxm3/fp4 branch in launch_b200-dgxc.sh copies .../minimax-m3/b200-fp4recipes/vllm/minimax-m3/b200-fp4, which matches the CONFIG_FILE=recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml in the config. The MODEL_PATH/SRT_SLURM_MODEL_PREFIX override for minimaxm3+fp4 already exists on main. Note: launch_b200-dgxc-slurm.sh (referenced in the PR description) is a symlink to launch_b200-dgxc.sh, so the single edit covers both names. ✅
  • Seq-len budget: max-model-len: 9472 ≥ ISL 8192 + OSL 1024 = 9216. ✅
  • Topology: prefill 2×2 GPUs + decode 1×8 GPUs on 1+1 nodes is internally consistent between the resources block and data-parallel-size. Decode max-num-seqs: 1024 is per DP rank (×8 = 8192 aggregate), comfortably above concurrency 4096. ✅

Minor (non-blocking, FYI only): the perf-changelog.yaml description says "Port the B300 4P2D DEP2/DEP8 recipe", but the actual source is the 2P1D recipe (b300-fp8/1k1k/2p1d-dep2-dep8-1k1k.yaml) and the new topology is 2P1D. Worth correcting the "4P2D" wording for accuracy, but it doesn't affect behavior.
· branch codex/add-minimaxm3-fp4-b200-dynamo-vllm

@jasonlizhengjian jasonlizhengjian marked this pull request as draft July 2, 2026 14:49
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@jasonlizhengjian jasonlizhengjian marked this pull request as ready for review July 2, 2026 14:52
@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Claude finished @jasonlizhengjian's task in 1m 47s —— View job


Re-review: MiniMax-M3 FP4 B200 Dynamo-vLLM disagg config

  • Gather context (diff, launcher routing, source recipe)
  • Validate master config ↔ perf-changelog coupling
  • Validate image accessibility
  • Validate launcher routing + recipe path
  • Review recipe correctness & topology

LGTM — no blocking issues found.

Since the previous review, the only new change is commit 3d84ee1f (concurrency 4096 → 2048). Verified against the B300 source (b300-fp4/8k1k/2p2d-dep2-dep8-8k1k.yaml, diff is clean):

  • Concurrency: master config conc-list: [2048] matches recipe concurrencies: "2048". Decode aggregate max-num-seqs = 1024 × DP8 = 8192, comfortably above 2048; max-cudagraph-capture-size: 4096 ≥ concurrency. ✅
  • Topology: prefill 2 workers × 2 GPUs (DP2) = 4 GPUs / 1 node; decode 1 worker × DP8 = 8 GPUs / 1 node. resources block and data-parallel-size values are internally consistent. ✅
  • Prefill omissions: max-cudagraph-capture-size and max-num-batched-tokens correctly dropped from prefill only (kept in decode), matching the PR description. ✅
  • Launcher + config path: new dynamo-vllm/minimaxm3/fp4 branch copies .../minimax-m3/b200-fp4recipes/vllm/minimax-m3/b200-fp4, resolving CONFIG_FILE=recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml. ✅
  • Master config / changelog / image: master edited + perf-changelog.yaml entry appended at the end with PR link; image: is a public Docker Hub reference matching the recipe container:; max-model-len: 9472 ≥ 9216 (ISL 8192 + OSL 1024). ✅

Minor (non-blocking, still open from prior review): the perf-changelog.yaml description says "Port the B300 4P2D DEP2/DEP8 recipe", but the cleanest source match and new topology are 2P2D → 2P1D. Cosmetic only.
· branch codex/add-minimaxm3-fp4-b200-dynamo-vllm

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Comment on lines +3 to +6
model:
path: "nvidia/MiniMax-M3-NVFP4"
container: "vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41"
precision: "fp4"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The launcher's model-path alias key and the recipe's model.path don't match for this new b200-dgxc + minimaxm3-fp4 pairing: runners/launch_b200-dgxc.sh:74 exports SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" but the new recipe at line 4 uses path: "nvidia/MiniMax-M3-NVFP4", so srtctl'''s model_paths lookup misses. Every other analogous case in the tree (b300-nv minimaxm3-fp4, b200 minimaxm2.5-fp4/fp8, b200 dsv4-fp4) has the launcher prefix exactly equal to the recipe path. Fix by changing one side to match the other — e.g. set SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4" for the minimaxm3-fp4 branch on b200-dgxc (mirroring runners/launch_b300-nv.sh:52), or change the new recipe'''s model.path to "minimax-m3-nvfp4".

Extended reasoning...

The mismatch

runners/launch_b200-dgxc.sh:71-74 (pre-existing from PR #1932) sets up the minimaxm3-fp4 model resolution:

elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then
    # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree.
    export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4"
    export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"

The launcher then writes srtslurm.yaml (around line 157):

model_paths:
  "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4"

But the new recipe at benchmarks/multi_node/srt-slurm-recipes/vllm/minimax-m3/b200-fp4/8k1k/2p1d-dep2-dep8-8k1k.yaml:4 uses:

model:
  path: "nvidia/MiniMax-M3-NVFP4"

srtctl looks up "nvidia/MiniMax-M3-NVFP4" in model_paths — the only registered alias is "minimax-m3-nvfp4", so the lookup misses.

Why this PR is the trigger

The pre-existing lines 71-74 were previously only exercised through the single-node code path (the else branch at the bottom of the launcher), which never touches SRT_SLURM_MODEL_PREFIX — it mounts $MODEL_PATH directly via --container-mounts and sets MODEL=$MODEL_PATH. So the mismatch was benign.

This PR adds the new elif at launch_b200-dgxc.sh:116-121 that first routes minimaxm3-fp4 through the srtctl / srt-slurm code path, which is the path that actually consumes SRT_SLURM_MODEL_PREFIX as an alias key in srtslurm.yaml. So the pre-existing but previously-latent mismatch becomes load-bearing exactly at this PR.

Cross-check with every other b200/b300 case

Launcher case SRT_SLURM_MODEL_PREFIX Recipe model.path Match?
b300-nv minimaxm3-fp4 (launch_b300-nv.sh:52) nvidia/MiniMax-M3-NVFP4 nvidia/MiniMax-M3-NVFP4
b300-nv minimaxm3-fp8 (launch_b300-nv.sh:55) MiniMaxAI/MiniMax-M3-MXFP8 MiniMaxAI/MiniMax-M3-MXFP8
b200-dgxc minimaxm2.5-fp4 minimax-m2.5-nvfp4 minimax-m2.5-nvfp4
b200-dgxc minimaxm2.5-fp8 minimax-m2.5-fp8 minimax-m2.5-fp8
b200-dgxc dsv4-fp4 deepseek-v4-pro deepseek-v4-pro
b200-dgxc minimaxm3-fp4 (this PR) minimax-m3-nvfp4 nvidia/MiniMax-M3-NVFP4

Every other pairing in the tree matches exactly; the new b200-dgxc minimaxm3-fp4 case is the sole outlier. The b300 minimaxm3-fp4 case in particular is instructive because the new b200 recipe is a direct port of the b300 4p2d-dep2-dep8-8k1k recipe (per the PR description), so it inherits nvidia/MiniMax-M3-NVFP4 — which matches on b300 but not on b200.

Step-by-step proof of the failure

  1. CI dispatches minimaxm3-fp4-b200-dynamo-vllm on the b200-multinode runner.
  2. launch_b200-dgxc.sh runs with IS_MULTINODE=true, MODEL_PREFIX=minimaxm3, PRECISION=fp4, FRAMEWORK=dynamo-vllm.
  3. Line 74 exports SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4" and MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4".
  4. The new elif at lines 116-121 fires, clones srt-slurm, and copies the recipe into recipes/vllm/minimax-m3/b200-fp4/.
  5. The cat > srtslurm.yaml <<EOF block writes model_paths: { "minimax-m3-nvfp4": "/scratch/fsw/models/MiniMax-M3-NVFP4" }.
  6. srtctl apply -f $CONFIG_FILE is invoked; srtctl parses the recipe and reads model.path: "nvidia/MiniMax-M3-NVFP4".
  7. srtctl checks model_paths for the key "nvidia/MiniMax-M3-NVFP4" — not present.
  8. Outcome A: srtctl errors on unknown alias and the job fails immediately at srtctl apply. Outcome B: srtctl treats the unmatched value as a HuggingFace hub identifier and attempts to download nvidia/MiniMax-M3-NVFP4 from the hub on every job invocation, negating the pre-staging that the comment on launch_b200-dgxc.sh:72 explicitly relies on.

Either outcome makes the full-sweep check fail — either the job errors out at model resolution, or the HF pull blows the container FS / times out the runner. The PR is labeled full-sweep-fail-fast-no-canary, so this will surface as a fail-fast failure.

Fix

One-liner in either direction. The b300 side of the tree is the reference implementation, so the least surprising change is to line 74 of runners/launch_b200-dgxc.sh:

 elif [[ $MODEL_PREFIX == "minimaxm3" && $PRECISION == "fp4" ]]; then
     # NVFP4 checkpoint, pre-staged on the b200-dgxc scratch tree.
     export MODEL_PATH="/scratch/fsw/models/MiniMax-M3-NVFP4"
-    export SRT_SLURM_MODEL_PREFIX="minimax-m3-nvfp4"
+    export SRT_SLURM_MODEL_PREFIX="nvidia/MiniMax-M3-NVFP4"

This also matches runners/launch_b300-nv.sh:52 verbatim, keeping the two clusters consistent for the same model+precision.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full-sweep-fail-fast-no-canary Full sweep, no canary gate; first failure in a matrix cancels that matrix

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant