Add Domino draft model training by SeaTuKeMa · Pull Request #118 · lightseekorg/TorchSpec

SeaTuKeMa · 2026-06-10T20:11:46Z

Summary

add a Domino draft model with a DFlash backbone and causal low-rank correction head
add the base-anchored curriculum training wrapper and trainer dispatch
register Domino configuration/model loading and provide a reference draft config
add CPU tests for configuration, forward behavior, curriculum semantics, gradient flow, and learning

Why

This adds training support for the Domino speculative decoding method while reusing TorchSpec's existing DFlash anchor sampling, masking, metrics, and distributed trainer path.

Addresses #114.

Usage

# Prepare data
python scripts/tools/prepare_perfectblend.py \
  --output data/perfectblend_50k.jsonl --sample-size 50000

# Train (8x H100: 4 inference + 4 training FSDP)
# Use a persistent volume such as /workspace/checkpoints on RunPod.
RUN_NAME=qwen3-8b-domino-8h100-500 \
TRAIN_DATA_PATH=data/perfectblend_50k.jsonl \
OUTPUT_ROOT=/workspace/checkpoints \
NUM_TRAIN_STEPS=500 \
SAVE_INTERVAL=250 \
MAX_CHECKPOINTS=1 \
SAVE_OPTIMIZER=false \
./examples/qwen3-8b-domino-8h100/run.sh

# Verify checkpoint output
cat /workspace/checkpoints/qwen3-8b-domino-8h100-500/checkpoints/latest_checkpointed_iteration.txt
ls /workspace/checkpoints/qwen3-8b-domino-8h100-500/checkpoints/iter_NNNNNNN

The run script accepts normal config overrides after the optional config path, for example:

./examples/qwen3-8b-domino-8h100/run.sh \
  configs/sglang_qwen3_8b_domino_2gpu.yaml \
  training.num_train_steps=20

Validation

pytest tests/test_domino.py tests/test_dflash.py -q (76 passed)
ruff check on all changed Python files
git diff --check
JSON and Python syntax validation

GPU Phase 1

Environment: Qwen3-8B Domino training with SGLang target inference on 2x A100-SXM4-80GB.
Stability:
- 100-step calibration completed; checkpoints saved at steps 50 and 100.
- 500-step run completed; checkpoint rotation saved through final iter_0000501.
- Final checkpoint state was verified: model, optimizer, LR scheduler, RNG, metadata, and latest-step marker.
Notes:
- SGLang max-context truncation warnings appeared for a few 3070-3072 token requests but did not stop training.
- A Ray queue actor warning appeared during shutdown after the final checkpoint had already been saved.

Run	Data	Steps	Time	Inference	Training	Checkpoint result
Phase 1A	2K PerfectBlend subset	100/100	705.5s	2.4 entries/s	2.3 entries/s	step 50, step 100
Phase 1B	10K PerfectBlend subset	500/500	2541.2s	3.2 entries/s	3.1 entries/s	final `iter_0000501`

Phase 1B 20-step window	Avg loss	Avg acceptance	Avg accepted length
Steps 1-20	13.504	0.019	0.020
Steps 81-100	6.569	0.083	0.108
Steps 181-200	5.484	0.175	0.303
Steps 281-300	4.917	0.229	0.436
Steps 381-400	4.666	0.251	0.499
Steps 481-500	4.557	0.253	0.512

GPU Phase 2

Environment: Qwen3-8B Domino training with SGLang target inference on 8x H100-SXM.
Configuration: 4 SGLang inference engines plus 4 FSDP training ranks via examples/qwen3-8b-domino-8h100/run.sh.
Data: PerfectBlend 50K request produced 47,484 valid normalized samples and 47,265 tokenized train samples.
Result:
- 500/500 steps completed in 614.6s.
- Checkpoints were written directly to a RunPod Network Volume under /workspace/checkpoints.
- Step 250 checkpoint saved successfully at iter_0000251.
- Final checkpoint saved successfully at iter_0000501; latest_checkpointed_iteration.txt reported 501.
- Run completed with average inference throughput 13.1 entries/s and training throughput 13.0 entries/s.
Profiling:
- TorchSpec perf metrics were enabled for all steps.
- 1-second GPU sampling was collected during the run for later bottleneck review.
- Per-step progress reported steady-state training throughput around 20-22 entries/s while the async sample pool stayed full.

Phase 2 window	Avg loss	Avg acceptance	Avg accepted length
Steps 1-20	13.987	0.017	0.017
Steps 231-250	5.147	0.207	0.360
Steps 451-500	4.447	0.271	0.541
Steps 481-500	4.426	0.271	0.546

loss:       13.987 -> 5.147 -> 4.426
acceptance:  0.017 -> 0.207 -> 0.271
acc_len:     0.017 -> 0.360 -> 0.546

Scope

This PR implements training support. Serving/export integration for the fused Domino head remains follow-up work.

jianuo-huang · 2026-06-16T05:59:37Z

Thanks for adding Domino support! This is exciting to see and very encouraging.
I’ll try it out when I get some time and share any feedback or issues I find.

Signed-off-by: TukeMa <fivedguy001@gmail.com>

SeaTuKeMa · 2026-06-23T08:43:34Z

@jianuo-huang @yubofredwang

Finished running some experiment on 8*H100 node, verified that implementation works. attached results in PR description, need review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d021fcc277

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-23T08:45:26Z

+        objective_weights = weight_mask
+        if (
+            self.loss_objective == "decay"


Honor dpace objective for Domino loss

When a Domino run is configured with training.dflash_loss_objective=dpace, DFlashModel.__init__ accepts that value but this forward path only applies the decay branch; objective_weights stays equal to the binary mask, so the requested D-PACE weighting is silently disabled for both base and final losses. This affects any Domino experiment that reuses the existing DFlash dpace objective override and makes its results incomparable to DFlash D-PACE runs; either implement the dpace branch here or reject the option for Domino.

Useful? React with 👍 / 👎.

yubofredwang · 2026-06-27T01:52:41Z

@SeaTuKeMa thanks for the great work! I will run e2e over the weekend

SeaTuKeMa force-pushed the domino-implement branch from 955f7c3 to 2fbf244 Compare June 10, 2026 20:12

SeaTuKeMa changed the title ~~[codex] Add Domino draft model training~~ Add Domino draft model training Jun 10, 2026

SeaTuKeMa mentioned this pull request Jun 10, 2026

Support train Domino draft model? #114

Open

feat: add Domino draft model training

2175da7

Signed-off-by: TukeMa <fivedguy001@gmail.com>

SeaTuKeMa force-pushed the domino-implement branch from b7af636 to 2175da7 Compare June 23, 2026 02:53

Add Domino 8-GPU validation launcher

d021fcc

SeaTuKeMa marked this pull request as ready for review June 23, 2026 08:41

chatgpt-codex-connector Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Domino draft model training#118

Add Domino draft model training#118
SeaTuKeMa wants to merge 2 commits into
lightseekorg:mainfrom
SeaTuKeMa:domino-implement

SeaTuKeMa commented Jun 10, 2026 •

edited

Loading

Uh oh!

jianuo-huang commented Jun 16, 2026

Uh oh!

SeaTuKeMa commented Jun 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Uh oh!

yubofredwang commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

SeaTuKeMa commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Usage

Validation

GPU Phase 1

GPU Phase 2

Scope

Uh oh!

jianuo-huang commented Jun 16, 2026

Uh oh!

SeaTuKeMa commented Jun 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

yubofredwang commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SeaTuKeMa commented Jun 10, 2026 •

edited

Loading