DFlash: support vLLM inference backend by zixi-qi · Pull Request #125 · lightseekorg/TorchSpec

zixi-qi · 2026-06-23T00:58:12Z

Summary

Adds vLLM as a supported inference backend for DFlash draft training. Previously DFlash was gated to inference_engine_type='sgl' only.

torchspec/train_entry.py — _validate_and_configure_dflash now accepts inference_engine_type in ('vllm', 'sgl') instead of sgl-only, with a clearer error message naming the allowed values.
configs/vllm_qwen3_8b_dflash.yaml — new DFlash training config for Qwen3-8B on the vLLM backend; the vLLM counterpart to the existing sglang_qwen3_8b_dflash.yaml (same DFlash architecture/hyperparameters, target hidden states generated by vLLM via extract_hidden_states + MooncakeHiddenStatesConnector). 4-GPU layout: 2 inference (tp_size=2) + 2 training (FSDP FULL_SHARD).

Test plan

ruff check + ruff format --check pass
pre-commit run --files ... passes (trailing-whitespace, end-of-file-fixer, check-yaml, ruff, ruff-format)
New config parses as valid YAML
End-to-end DFlash+vLLM training run on a 4-GPU node

E2E verification

Ran the new config end-to-end on a GB300 node (aarch64, CUDA 13, torch 2.11, vLLM 0.23.0):

./examples/qwen3-8b-single-node/run.sh configs/vllm_qwen3_8b_dflash.yaml

Result: completed cleanly — 3 epochs / 750 steps in 288.7s, checkpoint saved to outputs/qwen3-8b-dflash-vllm/checkpoints/iter_0000751, vLLM engine shut down gracefully.

Metric	Value (final / steady-state)
Loss	0.018
Accuracy	0.992
Accepted length (`acc_len`)	~14.3 (block_size 16)
Throughput	~4.5 step/s
Per-step compute	fwd ~83ms · bwd ~110ms · opt ~14ms (step ~0.21s)
Inference / training balance	avg 10.4 / 10.4 entries/s, `wait=0.0s` throughout

The disaggregated pipeline behaved as designed: the vLLM engine (tp_size=2) streamed target hidden states through Mooncake, and the sample-pool backpressure cycled normally (fills to 64/64 → pauses generation → resumes at ~52–56), with the trainer never starving (wait=0.0s).

🤖 Generated with Claude Code

Relax the DFlash backend gate in `_validate_and_configure_dflash` to accept `inference_engine_type` in ('vllm', 'sgl') instead of 'sgl'-only, and add a vLLM DFlash training config for Qwen3-8B (`configs/vllm_qwen3_8b_dflash.yaml`) as the vLLM counterpart to the existing SGLang DFlash config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: zixi-qi <zixi@inferact.ai>

zixi-qi force-pushed the dflash-vllm-support branch from 2335a76 to a3614b9 Compare June 23, 2026 01:00

zixi-qi marked this pull request as ready for review June 23, 2026 01:02

yubofredwang approved these changes Jun 23, 2026

View reviewed changes

yubofredwang merged commit a81000a into lightseekorg:main Jun 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DFlash: support vLLM inference backend#125

DFlash: support vLLM inference backend#125
yubofredwang merged 1 commit into
lightseekorg:mainfrom
zixi-qi:dflash-vllm-support

zixi-qi commented Jun 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zixi-qi commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

E2E verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zixi-qi commented Jun 23, 2026 •

edited

Loading