Skip to content

Add Domino draft model training#118

Open
SeaTuKeMa wants to merge 2 commits into
lightseekorg:mainfrom
SeaTuKeMa:domino-implement
Open

Add Domino draft model training#118
SeaTuKeMa wants to merge 2 commits into
lightseekorg:mainfrom
SeaTuKeMa:domino-implement

Conversation

@SeaTuKeMa

@SeaTuKeMa SeaTuKeMa commented Jun 10, 2026

Copy link
Copy Markdown

Summary

  • add a Domino draft model with a DFlash backbone and causal low-rank correction head
  • add the base-anchored curriculum training wrapper and trainer dispatch
  • register Domino configuration/model loading and provide a reference draft config
  • add CPU tests for configuration, forward behavior, curriculum semantics, gradient flow, and learning

Why

This adds training support for the Domino speculative decoding method while reusing TorchSpec's existing DFlash anchor sampling, masking, metrics, and distributed trainer path.

Addresses #114.

Usage

# Prepare data
python scripts/tools/prepare_perfectblend.py \
  --output data/perfectblend_50k.jsonl --sample-size 50000

# Train (8x H100: 4 inference + 4 training FSDP)
# Use a persistent volume such as /workspace/checkpoints on RunPod.
RUN_NAME=qwen3-8b-domino-8h100-500 \
TRAIN_DATA_PATH=data/perfectblend_50k.jsonl \
OUTPUT_ROOT=/workspace/checkpoints \
NUM_TRAIN_STEPS=500 \
SAVE_INTERVAL=250 \
MAX_CHECKPOINTS=1 \
SAVE_OPTIMIZER=false \
./examples/qwen3-8b-domino-8h100/run.sh

# Verify checkpoint output
cat /workspace/checkpoints/qwen3-8b-domino-8h100-500/checkpoints/latest_checkpointed_iteration.txt
ls /workspace/checkpoints/qwen3-8b-domino-8h100-500/checkpoints/iter_NNNNNNN

The run script accepts normal config overrides after the optional config path, for example:

./examples/qwen3-8b-domino-8h100/run.sh \
  configs/sglang_qwen3_8b_domino_2gpu.yaml \
  training.num_train_steps=20

Validation

  • pytest tests/test_domino.py tests/test_dflash.py -q (76 passed)
  • ruff check on all changed Python files
  • git diff --check
  • JSON and Python syntax validation

GPU Phase 1

  • Environment: Qwen3-8B Domino training with SGLang target inference on 2x A100-SXM4-80GB.
  • Stability:
    • 100-step calibration completed; checkpoints saved at steps 50 and 100.
    • 500-step run completed; checkpoint rotation saved through final iter_0000501.
    • Final checkpoint state was verified: model, optimizer, LR scheduler, RNG, metadata, and latest-step marker.
  • Notes:
    • SGLang max-context truncation warnings appeared for a few 3070-3072 token requests but did not stop training.
    • A Ray queue actor warning appeared during shutdown after the final checkpoint had already been saved.
Run Data Steps Time Inference Training Checkpoint result
Phase 1A 2K PerfectBlend subset 100/100 705.5s 2.4 entries/s 2.3 entries/s step 50, step 100
Phase 1B 10K PerfectBlend subset 500/500 2541.2s 3.2 entries/s 3.1 entries/s final iter_0000501
Phase 1B 20-step window Avg loss Avg acceptance Avg accepted length
Steps 1-20 13.504 0.019 0.020
Steps 81-100 6.569 0.083 0.108
Steps 181-200 5.484 0.175 0.303
Steps 281-300 4.917 0.229 0.436
Steps 381-400 4.666 0.251 0.499
Steps 481-500 4.557 0.253 0.512

GPU Phase 2

  • Environment: Qwen3-8B Domino training with SGLang target inference on 8x H100-SXM.
  • Configuration: 4 SGLang inference engines plus 4 FSDP training ranks via examples/qwen3-8b-domino-8h100/run.sh.
  • Data: PerfectBlend 50K request produced 47,484 valid normalized samples and 47,265 tokenized train samples.
  • Result:
    • 500/500 steps completed in 614.6s.
    • Checkpoints were written directly to a RunPod Network Volume under /workspace/checkpoints.
    • Step 250 checkpoint saved successfully at iter_0000251.
    • Final checkpoint saved successfully at iter_0000501; latest_checkpointed_iteration.txt reported 501.
    • Run completed with average inference throughput 13.1 entries/s and training throughput 13.0 entries/s.
  • Profiling:
    • TorchSpec perf metrics were enabled for all steps.
    • 1-second GPU sampling was collected during the run for later bottleneck review.
    • Per-step progress reported steady-state training throughput around 20-22 entries/s while the async sample pool stayed full.
Phase 2 window Avg loss Avg acceptance Avg accepted length
Steps 1-20 13.987 0.017 0.017
Steps 231-250 5.147 0.207 0.360
Steps 451-500 4.447 0.271 0.541
Steps 481-500 4.426 0.271 0.546
loss:       13.987 -> 5.147 -> 4.426
acceptance:  0.017 -> 0.207 -> 0.271
acc_len:     0.017 -> 0.360 -> 0.546

Scope

This PR implements training support. Serving/export integration for the fused Domino head remains follow-up work.

@SeaTuKeMa SeaTuKeMa changed the title [codex] Add Domino draft model training Add Domino draft model training Jun 10, 2026
@jianuo-huang

Copy link
Copy Markdown

Thanks for adding Domino support! This is exciting to see and very encouraging.
I’ll try it out when I get some time and share any feedback or issues I find.

Signed-off-by: TukeMa <fivedguy001@gmail.com>
@SeaTuKeMa SeaTuKeMa marked this pull request as ready for review June 23, 2026 08:41
@SeaTuKeMa

Copy link
Copy Markdown
Author

@jianuo-huang @yubofredwang

Finished running some experiment on 8*H100 node, verified that implementation works. attached results in PR description, need review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d021fcc277

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +157 to +159
objective_weights = weight_mask
if (
self.loss_objective == "decay"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor dpace objective for Domino loss

When a Domino run is configured with training.dflash_loss_objective=dpace, DFlashModel.__init__ accepts that value but this forward path only applies the decay branch; objective_weights stays equal to the binary mask, so the requested D-PACE weighting is silently disabled for both base and final losses. This affects any Domino experiment that reuses the existing DFlash dpace objective override and makes its results incomparable to DFlash D-PACE runs; either implement the dpace branch here or reject the option for Domino.

Useful? React with 👍 / 👎.

@yubofredwang

Copy link
Copy Markdown
Collaborator

@SeaTuKeMa thanks for the great work! I will run e2e over the weekend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants