-
Notifications
You must be signed in to change notification settings - Fork 217
Add GLM-5-FP8 GB300 multinode dynamo-sglang MTP benchmark #1907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
85f0419
7f7d765
927159c
c609fa2
6ec9dce
20c089e
ba34232
217e156
e74544c
16f9e33
98925a7
f3ede16
b37a8dd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,152 @@ | ||
| name: gb300-fp8-glm5-mtp_1k1k_hightpt_0 | ||
|
|
||
| model: | ||
| path: glm-5-fp8 | ||
| container: "lmsysorg/sglang:v0.5.11-cu130" | ||
| precision: fp8 | ||
|
|
||
| resources: | ||
| gpu_type: gb300 | ||
| gpus_per_node: 4 | ||
| prefill_nodes: 12 | ||
| prefill_workers: 12 | ||
| decode_nodes: 6 | ||
| decode_workers: 1 | ||
| frontend: | ||
| type: dynamo | ||
| enable_multiple_frontends: true | ||
| num_additional_frontends: 9 | ||
| dynamo: | ||
| version: 1.2.1 | ||
|
|
||
| backend: | ||
| prefill_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' | ||
| PYTHONUNBUFFERED: '1' | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: '1' | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' | ||
| MC_TE_METRIC: 'true' | ||
| MC_FORCE_MNNVL: '1' | ||
| NCCL_MNNVL_ENABLE: '1' | ||
| NCCL_CUMEM_ENABLE: '1' | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' | ||
| DYN_REQUEST_PLANE: nats | ||
|
|
||
| decode_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' | ||
| PYTHONUNBUFFERED: '1' | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: '1' | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' | ||
| MC_TE_METRIC: 'true' | ||
| MC_FORCE_MNNVL: '1' | ||
| NCCL_MNNVL_ENABLE: '1' | ||
| NCCL_CUMEM_ENABLE: '1' | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' | ||
| DYN_REQUEST_PLANE: nats | ||
| # DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size). | ||
| # Default 128 overflows with large DP + batch (e.g. 4096/24 ~= 171 > 128). Limit 1024. | ||
| SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '1024' | ||
|
|
||
| sglang_config: | ||
| prefill: | ||
| # Model configuration | ||
| served-model-name: GLM-5-FP8 | ||
| trust-remote-code: true | ||
| quantization: fp8 | ||
| kv-cache-dtype: fp8_e4m3 | ||
|
|
||
| # Disaggregation mode | ||
| disaggregation-mode: prefill | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Size limits | ||
| max-running-requests: 256 | ||
| cuda-graph-max-bs: 256 | ||
| mem-fraction-static: 0.7 | ||
| context-length: 9600 | ||
| chunked-prefill-size: 32768 | ||
| max-prefill-tokens: 8192 | ||
|
|
||
| # Parallelism | ||
| tensor-parallel-size: 4 | ||
| data-parallel-size: 4 | ||
| expert-parallel-size: 1 | ||
| enable-dp-attention: true | ||
| enable-dp-lm-head: true | ||
| load-balance-method: total_tokens | ||
|
|
||
| # Backend | ||
| nsa-decode-backend: trtllm | ||
| nsa-prefill-backend: trtllm | ||
| moe-runner-backend: flashinfer_trtllm | ||
|
|
||
| # Other flags | ||
| enable-flashinfer-allreduce-fusion: true | ||
| disable-radix-cache: true | ||
| weight-loader-prefetch-checkpoints: true | ||
| model-loader-extra-config: '{"enable_multithread_load": true}' | ||
|
|
||
| decode: | ||
| # Model configuration | ||
| served-model-name: GLM-5-FP8 | ||
| trust-remote-code: true | ||
|
|
||
| quantization: fp8 | ||
| kv-cache-dtype: fp8_e4m3 | ||
|
|
||
| # Disaggregation mode | ||
| disaggregation-mode: decode | ||
| disaggregation-transfer-backend: nixl | ||
|
|
||
| # Memory and token limits | ||
| mem-fraction-static: 0.8 | ||
| context-length: 9600 | ||
|
|
||
| # Backend | ||
| nsa-decode-backend: trtllm | ||
| nsa-prefill-backend: trtllm | ||
| # moe-runner-backend: "cutedsl" | ||
|
|
||
| # Detokenizer | ||
| skip-tokenizer-init: true | ||
| stream-interval: 30 | ||
|
|
||
| # Other flags | ||
| disable-radix-cache: true | ||
| weight-loader-prefetch-checkpoints: true | ||
| model-loader-extra-config: '{"enable_multithread_load": true}' | ||
| tensor-parallel-size: 24 | ||
| expert-parallel-size: 24 | ||
| data-parallel-size: 24 | ||
| enable-dp-lm-head: true | ||
| enable-dp-attention: true | ||
| moe-dense-tp-size: 1 | ||
| ep-num-redundant-experts: 32 | ||
| ep-dispatch-algorithm: static | ||
| moe-a2a-backend: deepep | ||
| deepep-mode: low_latency | ||
| deepep-config: /configs/deepep_config.json | ||
| max-running-requests: 8192 | ||
| cuda-graph-max-bs: 512 | ||
| speculative-algorithm: "EAGLE" | ||
| speculative-num-steps: 2 | ||
| speculative-eagle-topk: 1 | ||
| speculative-num-draft-tokens: 3 | ||
| health_check: | ||
| max_attempts: 360 | ||
| interval_seconds: 10 | ||
|
|
||
| benchmark: | ||
| type: sa-bench | ||
| req_rate: inf | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: '8192' | ||
|
Comment on lines
+147
to
+152
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 All 14 new MTP recipe YAMLs under Extended reasoning...What the bug isEvery MTP YAML in this PR (14 files under benchmark:
type: sa-bench
req_rate: inf
isl: <isl>
osl: 1024
concurrencies: '<N>'There is no Why this is mandatory for MTP
For multi-node recipes consumed by sa-bench, the YAML key Why existing code doesn't prevent itNothing in the loader or runtime cross-checks PrecedentEvery existing sglang multi-node MTP recipe in the repo sets this field:
GLM-5 specifically requires chat-template formatting for EAGLE to perform as intended — the per-platform MTP scripts in this repo already encode that rule. Step-by-step proofTake
Repeat verbatim for the other 13 files; same structure, same omission. FixAdd one line under the benchmark:
type: sa-bench
req_rate: inf
isl: <isl>
osl: 1024
concurrencies: '<N>'
use_chat_template: trueFiles to update:
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 All 14 new GB300 MTP recipe YAMLs omit
SGLANG_ENABLE_SPEC_V2: '1'from bothprefill_environmentanddecode_environmentblocks, even though every other GLM-5 / SGLang MTP path in this repo (every existingdsr1/b200-fp4/{1k1k,8k1k}/disagg/mtp/*.yamlrecipe, every single-node*_mtp.shlauncher includingbenchmarks/single_node/fixed_seq_len/glm5_fp8_b300_mtp.sh:39, andbenchmarks/multi_node/amd_utils/env.sh:156) sets it explicitly.runners/launch_gb300-nv.shdoes not inject it either, so the recipe YAML is the only entry point — without it, EAGLE onlmsysorg/sglang:v0.5.11-cu130will run via the legacy spec-decoding path (or silently no-op with the NSA + DeepEP + DPA decode topology), producing decode behavior inconsistent with every other validated MTP benchmark in the repo and invalidating the new measurements. Fix: addSGLANG_ENABLE_SPEC_V2: '1'to both env blocks in every new MTP recipe (matching thedsr1MTP precedent).Extended reasoning...
What the bug is
All 14 new MTP recipe YAMLs added by this PR (
benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/*.yaml) are missing theSGLANG_ENABLE_SPEC_V2: '1'environment variable in bothprefill_environmentanddecode_environmentblocks. This env var is the documented enablement gate for EAGLE/MTP speculative decoding in SGLang across this repo.Why this is a bug — overwhelming precedent
Every other MTP launch path in this repo sets this variable explicitly:
benchmarks/multi_node/srt-slurm-recipes/sglang/dsr1/b200-fp4/{1k1k,8k1k}/disagg/mtp/*.yamlsetsSGLANG_ENABLE_SPEC_V2: '1'in both prefill and decode env blocks (e.g.dsr1/b200-fp4/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml:48and:66— 20 hits across 10 files).benchmarks/single_node/fixed_seq_len/glm5_fp8_b300_mtp.sh:39runsexport SGLANG_ENABLE_SPEC_V2=1immediately beforesglang.launch_server --speculative-algorithm EAGLE. Same inglm5_fp8_b200_mtp.sh,glm5_fp8_mi355x_mtp.sh,glm5_fp4_b300_mtp.sh,glm5_fp4_b200_mtp.sh.benchmarks/multi_node/amd_utils/env.sh:156exports it.perf-changelog.yamldescribes every prior GLM-5 MTP entry (b300/b200/mi355x FP8 and FP4 variants) verbatim as "adds EAGLE speculative decoding ... behind SGLANG_ENABLE_SPEC_V2=1" — this is the maintainer-documented contract.Why existing code doesn't catch it
runners/launch_gb300-nv.shcontains zero references toSPEC_V2,speculative, orMTP— it onlysrtctl applys the recipe YAML. The recipe YAML'sprefill_environment/decode_environmentis the only place SGLang env vars reach the worker containers on this launch path. A missing entry is not silently filled in.Root cause (confirmed by PR description)
The PR description states the new recipes are "byte-identical to the existing
stp/siblings except for the EAGLE speculative-decoding flags on the decode block." The STP siblings don't need this env var (no spec decoding), so the copy carried the STP environment forward and the new EAGLE-specific env var was never added. Spot-check:diff stp/1k1k_stp_hightpt_0.yaml mtp/1k1k_mtp_hightpt_0.yamlshows the new MTP file is byte-identical to STP except for the name change and the fourspeculative-*keys appended to the decode block.Step-by-step proof of impact
launch_gb300-nv.shforglm5-fp8-gb300-dynamo-sglang-mtp.recipes/sglang/glm5/gb300-fp8/.../mtp/*.yamlinto srt-slurm and runssrtctl apply. It does not injectSGLANG_ENABLE_SPEC_V2.prefill_environment/decode_environmentfrom the YAML and exports them into the worker containers. Neither block containsSGLANG_ENABLE_SPEC_V2.--speculative-algorithm EAGLEbut withoutSGLANG_ENABLE_SPEC_V2=1— it routes EAGLE through the legacy v1 spec-decoding code path (or silently disables spec for the NSA + DeepEP + DPA decode topology, since v2 is the implementation that supports this combination in v0.5.11).perf-changelog.yaml, and different from the GB300 single-node siblingglm5_fp8_b300_mtp.sh.The whole point of
-mtpis to measure EAGLE MTP performance; withoutSPEC_V2=1the published numbers do not represent the intended config, defeating the purpose of the entry and breaking the apples-to-apples comparison with the existing MTP benchmarks.Fix
Add
SGLANG_ENABLE_SPEC_V2: '1'to bothprefill_environmentanddecode_environmentin every new MTP recipe (28 env blocks across 14 files), matching thedsr1MTP recipes exactly.