Skip to content

DFlash: support vLLM inference backend#125

Merged
yubofredwang merged 1 commit into
lightseekorg:mainfrom
zixi-qi:dflash-vllm-support
Jun 23, 2026
Merged

DFlash: support vLLM inference backend#125
yubofredwang merged 1 commit into
lightseekorg:mainfrom
zixi-qi:dflash-vllm-support

Conversation

@zixi-qi

@zixi-qi zixi-qi commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds vLLM as a supported inference backend for DFlash draft training. Previously DFlash was gated to inference_engine_type='sgl' only.

  • torchspec/train_entry.py_validate_and_configure_dflash now accepts inference_engine_type in ('vllm', 'sgl') instead of sgl-only, with a clearer error message naming the allowed values.
  • configs/vllm_qwen3_8b_dflash.yaml — new DFlash training config for Qwen3-8B on the vLLM backend; the vLLM counterpart to the existing sglang_qwen3_8b_dflash.yaml (same DFlash architecture/hyperparameters, target hidden states generated by vLLM via extract_hidden_states + MooncakeHiddenStatesConnector). 4-GPU layout: 2 inference (tp_size=2) + 2 training (FSDP FULL_SHARD).

Test plan

  • ruff check + ruff format --check pass
  • pre-commit run --files ... passes (trailing-whitespace, end-of-file-fixer, check-yaml, ruff, ruff-format)
  • New config parses as valid YAML
  • End-to-end DFlash+vLLM training run on a 4-GPU node

E2E verification

Ran the new config end-to-end on a GB300 node (aarch64, CUDA 13, torch 2.11, vLLM 0.23.0):

./examples/qwen3-8b-single-node/run.sh configs/vllm_qwen3_8b_dflash.yaml

Result: completed cleanly — 3 epochs / 750 steps in 288.7s, checkpoint saved to outputs/qwen3-8b-dflash-vllm/checkpoints/iter_0000751, vLLM engine shut down gracefully.

Metric Value (final / steady-state)
Loss 0.018
Accuracy 0.992
Accepted length (acc_len) ~14.3 (block_size 16)
Throughput ~4.5 step/s
Per-step compute fwd ~83ms · bwd ~110ms · opt ~14ms (step ~0.21s)
Inference / training balance avg 10.4 / 10.4 entries/s, wait=0.0s throughout

The disaggregated pipeline behaved as designed: the vLLM engine (tp_size=2) streamed target hidden states through Mooncake, and the sample-pool backpressure cycled normally (fills to 64/64 → pauses generation → resumes at ~52–56), with the trainer never starving (wait=0.0s).

🤖 Generated with Claude Code

Relax the DFlash backend gate in `_validate_and_configure_dflash` to accept
`inference_engine_type` in ('vllm', 'sgl') instead of 'sgl'-only, and add a
vLLM DFlash training config for Qwen3-8B (`configs/vllm_qwen3_8b_dflash.yaml`)
as the vLLM counterpart to the existing SGLang DFlash config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: zixi-qi <zixi@inferact.ai>
@zixi-qi zixi-qi force-pushed the dflash-vllm-support branch from 2335a76 to a3614b9 Compare June 23, 2026 01:00
@zixi-qi zixi-qi marked this pull request as ready for review June 23, 2026 01:02
@yubofredwang yubofredwang merged commit a81000a into lightseekorg:main Jun 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants