🌍 Languages: English | 中文 | 日本語
👉 TL;DR: Open-dLLM is the most open release of a diffusion-based large language model to date —
including pretraining, evaluation, inference, and checkpoints.
Open-dLLM supports representation alignment for adapting autoregressive LMs into diffusion LMs with 4x speedup. This feature is based on our recent paper, Don’t Retrain—Align: Adapting Autoregressive LMs to Diffusion LMs via Representation Alignment. Check out Representation Alignment Tutorial.
💻 Code | 📖 Blog | 🤗 Model
QuickSort generation using Open-dCoder (0.5B)
- 🏋️ Pretraining pipeline + open datasets
- ⚡ Inference scripts — easy sampling & generation
- 📊 Evaluation suite — HumanEval, MBPP, Infilling (lm-eval-harness + custom metrics)
- 📦 Weights + checkpoints on Hugging Face
- 🤝 Transparent configs for full reproducibility
Most diffusion LLM repos (e.g., LLaDA, Dream) only release inference scripts + weights, which limits reproducibility.
Open-dLLM is the first to open-source the entire stack for diffusion LLMs.
👉 With Open-dLLM, you can go from raw data → training → checkpoints → evaluation → inference, all in one repo.
All configs trained on 50 FineWeb examples, 10 epochs. Full reproduce guide in docs/reproduce.md.
| Config | Script | Trainable Params | Inference tok/s (8 steps, 128 tok) |
|---|---|---|---|
| Random Masking (baseline) | train_torch.py |
1.7B | 1131 |
| + Repr-Align (4 layers) | train_torch.py |
1.7B | 1147 |
| + Repr-Align (all 28 layers) | train_torch.py |
1.7B | 1178 |
| + Repr-Align + d3LLM Trajectory | train_torch.py |
1.7B | 1183 |
| LDLM (Perceiver+DiT) | train_ldlm.py |
~200M | 951 |
| VFM (noise adapter) | train_vfm.py |
~100M | 923 |
| Cola DLM (VAE+DiT head) | train_torch.py |
1.7B + ~50M | TBD |
Key takeaway: All Repr-Align paths have identical inference speed (same architecture). The benefit comes from fewer denoising steps needed after training — not from faster per-step execution.
| Steps | tok/s (128 new tokens) | Total time |
|---|---|---|
| 8 | 115 | 1.1s |
| 16 | 57 | 2.2s |
| 32 | 29 | 4.4s |
| 64 | 14 | 8.9s |
| 128 | 7 | 17.9s |
Per-step cost: ~138ms (model-bound, 27B NF4 QLoRA on RTX 5090).
All metrics logged to wandb.ai/snoozie/open-dllm-27b and wandb.ai/snoozie/open-dllm-compare.
End-to-end recipe to train Qwen3.6-27B with entropy-based trajectory-guided masking.
Download the Qwen3.6-Plus reasoning dataset (500 examples, ~7 MB):
from datasets import load_dataset
import json
ds = load_dataset("khazarai/qwen3.6-plus-high-reasoning-500x", split="train")
with open("train.jsonl", "w") as f:
for i in range(len(ds)):
msg = ds[i]["messages"]
f.write(json.dumps({
"idx": i,
"prompt": msg[0]["content"],
"response": msg[1]["content"]
}) + "\n")Each example has a user prompt (63-105 tokens) + a rich assistant reasoning response (2,300-4,100 tokens) — ideal for trajectory distillation.
Generate the unmasking order by running the 27B model in entropy-threshold decode mode over each response:
CUDA_VISIBLE_DEVICES=0 .venv/bin/python scripts/gen_trajectories_reasoning.py \
--data_path /path/to/train.jsonl \
--output_dir /path/to/trajectories/ \
--model_path /path/to/Qwen3.6-27B \
--num_steps 32 \
--max_seq_len 2048Output: trajectories.jsonl — one entry per example with idx, trajectory (list of token-ID sequences at each decode step), and nfe (number of steps).
Each trajectory step records which positions are still mask tokens. The collator uses these to determine training-time masking positions. Step 0 has all response tokens masked; final step has all unmasked.
Use the prepared config configs/pretrain/d3llm_27b_reasoning.yaml:
CUDA_VISIBLE_DEVICES=0 \
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python tasks/train_torch.py configs/pretrain/d3llm_27b_reasoning.yamlThe config wraps Qwen3.6-27B in NF4 QLoRA (r=4) with use_hf_native: true for Gated DeltaNet compatibility. Trajectory-guided masking is active via DataCollatorWithTrajectoryMasking — every 200 steps, three wandb panels log to wandb.ai/snoozie/open-dllm-27b:
| Panel | Key | What it shows |
|---|---|---|
| Trajectory ordering | d3llm/trajectory |
Mask heatmap (rows=decode steps, cols=positions) + entropy bars |
| Prediction quality | d3llm/prediction |
RGB strip (green=correct, red=wrong, grey=unmasked) + confidence |
| Training history | d3llm/history |
Mask-ratio × CE-loss scatter + loss curves |
# d3llm_27b_reasoning.yaml highlights
model:
model_path: /path/to/Qwen3.6-27B
attn_implementation: sdpa
enable_qlorafy: true
qlorafy_config:
use_hf_native: true # Required for Gated DeltaNet
r: 4 # r=4 fits better on 32 GB
train:
max_seq_len: 512 # Reduce to 1024 if VRAM allows
enable_masking: true
repr_align_wt: 0.0 # Pure d3LLM, no repr-align
trajectory_data_path: /path/to/trajectories.jsonl
trajectory_min_mask_ratio: 0.1
trajectory_max_mask_ratio: 0.5
trajectory_entropy_weight: 0.1Open-dLLM/
├── tasks/ # Training entry points
│ ├── train_torch.py # Standard / Repr-Align / Cola DLM training
│ ├── train_ldlm.py # LDLM (Perceiver encoder/decoder + DiT head)
│ ├── benchmark_ldlm.py # 27B LDLM inference benchmark
│ ├── benchmark_ldlm_35b.py # 35B-A3B LDLM inference benchmark
│ ├── infer.py # Generation entry point
│ └── sample.py # Interactive sampling
│
├── configs/pretrain/ # Training configs (YAML)
│ ├── compare_50x_no_align.yaml # Baseline: random masking
│ ├── compare_50x_with_align.yaml# Repr-Align (4 layers)
│ ├── compare_50x_with_align_all_layers.yaml # Repr-Align (all layers)
│ ├── compare_50x_with_trajectory.yaml # Repr-Align + d3LLM trajectories
│ ├── compare_50x_ldlm.yaml # LDLM comparison
│ ├── compare_50x_vfm.yaml # VFM comparison
│ ├── compare_50x_cola.yaml # Cola DLM comparison
│ ├── qwen3_6_27b_repr_align_100k.yaml # 27B Repr-Align (100K, single 5090)
│ ├── qwen3_6_27b_qlora_repr_align.yaml # 27B QLoRA Repr-Align
│ ├── d3llm_27b_100_traj.yaml # 27B d3LLM + trajectories (100 ex)
│ └── d3llm_27b_4k.yaml # 27B d3LLM, seq_len=4096
│
├── veomni/
│ ├── models/
│ │ ├── transformers/ # Model implementations
│ │ │ ├── qwen2/ # Qwen2 / Open-dCoder
│ │ │ ├── qwen3/ # Qwen3
│ │ │ ├── qwen3_5/ # Qwen3.5/3.6 (Gated DeltaNet)
│ │ │ └── qwen3_5_moe/ # Qwen3.5/3.6 MoE (256 experts)
│ │ ├── ldlm/ # LDLM autoencoder + diffusion head
│ │ ├── hf_mdm_qlora.py # HF-native QLoRA + MDM wrapper
│ │ ├── cached_teacher.py # CachedTeacher for Repr-Align
│ │ └── auto.py # Model dispatcher
│ ├── distributed/ # Parallel strategies
│ │ ├── deepspeed_init.py # DeepSpeed ZeRO-3 + NVMe offload
│ │ ├── moe/ # Expert parallelism
│ │ └── sequence_parallel/ # Ulysses sequence parallelism
│ └── ops/
│ ├── trajectory_extractor.py # d3LLM trajectory precomputation
│ └── loss.py # Fused cross-entropy
│
├── scripts/
│ ├── benchmark_inference.py # 27B inference throughput sweep
│ ├── benchmark_inference_post.py# Post-training benchmark (wandb)
│ ├── compare_step_quality.py # Step count vs output quality
│ ├── precompute_anchor.py # Repr-Align teacher cache
│ ├── precompute_trajectories.py # d3LLM trajectories (entropy + LR modes)
│ └── run_comparison.sh # Orchestrate full 7-config comparison
│
├── docs/
│ ├── reproduce.md # Full reproduce guide (this commit)
│ ├── representation_alignment.md # Repr-Align tutorial
│ ├── cloud_training.md # Vast.ai setup guide
│ ├── ldlm.md # LDLM architecture, training recipe, benchmarks
│ ├── multi_block_decoder.md # Multi-block decoder API + status
│ └── hardware.md # System requirements, hardware investigation
│
└── eval/
├── eval_completion/ # HumanEval, MBPP
└── eval_infill/ # Code infilling
| Project | Data | Training Code | Inference | Evaluation | Weights |
|---|---|---|---|---|---|
| Open-dLLM / Open-dCoder (ours) | ✅ | ✅ | ✅ | ✅ | ✅ |
| LLaDA | ❌ | ❌ | ✅ | ✅ | |
| Dream | ❌ | ❌ | ✅ | ✅ | |
| Gemini-Diffusion | ❌ | ❌ | ❌ | ❌ | ❌ (API only) |
| Seed Diffusion | ❌ | ❌ | ❌ | ❌ | ❌ (API only) |
| Mercury | ❌ | ❌ | ❌ | ❌ | ❌ (API only) |
✅ = fully available · ❌ = not provided ·
We use micromamba for environment management (feel free to adapt to conda):
micromamba install -c nvidia/label/cuda-12.3.0 cuda-toolkit -y
pip install ninja
# install the newest torch with cu121
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu121
pip install "flash-attn==2.7.4.post1" \
--extra-index-url https://github.com/Dao-AILab/flash-attention/releases/download
pip install --upgrade --no-cache-dir \
tensordict torchdata triton>=3.1.0 \
transformers==4.54.1 accelerate datasets peft hf-transfer \
codetiming hydra-core pandas pyarrow>=15.0.0 pylatexenc \
wandb ninja liger-kernel==0.5.8
# optional
pip install pytest yapf py-spy pyext pre-commit ruff packaging
pip install -e .
pip install lm-evaluation-harness/ human-eval-infilling/from transformers import AutoTokenizer
from veomni.models.transformers.qwen2.modeling_qwen2 import Qwen2ForCausalLM
from veomni.models.transformers.qwen2.generation_utils import MDMGenerationConfig
import torch
model_id = "fredzzp/open-dcoder-0.5B"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Load tokenizer + model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = Qwen2ForCausalLM.from_pretrained(
model_id, torch_dtype=torch.bfloat16, trust_remote_code=True
).to(device).eval()
# Prompt
prompt = "Write a quick sort algorithm in python."
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
# Generation config
gen_cfg = MDMGenerationConfig(max_new_tokens=128, steps=200, temperature=0.7)
with torch.no_grad():
outputs = model.diffusion_generate(inputs=input_ids, generation_config=gen_cfg)
print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=True))👉 For full logging, history tracking, and file output:
python sample.pyFor 27B+ models that don't fit in GPU memory at full precision, use QLoRA Repr-Align: NF4 quantized base (frozen) + LoRA adapters (trainable). Fits in ~25 GB VRAM with r=32.
No separate quantization step is needed. bitsandbytes quantizes weights on-the-fly during from_pretrained() via BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4"). The original bf16 weights on disk are never modified — quantization happens in GPU memory at load time. The 55 GB bf16 checkpoint becomes ~7 GB in VRAM.
- Download model weights (~55 GB):
huggingface-cli download Qwen/Qwen3.6-27B --local-dir /path/to/Qwen3.6-27B- Prepare training data (plaintext JSONL with a
textfield):
python -c "
from datasets import load_dataset
import json
ds = load_dataset('HuggingFaceFW/fineweb', name='sample-10BT', split='train', streaming=True)
with open('data.jsonl', 'w') as f:
for i, ex in enumerate(ds):
if i >= 100000: break
f.write(json.dumps({'text': ex['text']}) + '\n')
"- Precompute teacher anchor cache (one-time). This runs the frozen teacher model on your training data and caches hidden states for selected layers. The cached anchors are reused every training step — no live teacher needed during training.
For a smoke test (1000 examples, 4 layers, ~2 min):
CUDA_VISIBLE_DEVICES=0 python scripts/precompute_anchor.py \
--model_path /path/to/Qwen3.6-27B \
--data_path /path/to/data.jsonl \
--output_dir /path/to/anchors/qwen3.6-27b \
--layers 16,32,48,64 \
--max_seq_len 1024 \
--max_examples 1000For production (100K examples, all 64 layers — recommended for best alignment quality). This requires a GPU with ≥32 GB VRAM or a cloud instance:
CUDA_VISIBLE_DEVICES=0 python scripts/precompute_anchor.py \
--model_path /path/to/Qwen3.6-27B \
--data_path /path/to/data.jsonl \
--output_dir /path/to/anchors/qwen3.6-27b-all64 \
--layers 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64 \
--max_seq_len 1024Note: The
--layersargument uses 1-indexed layer numbers.--max_exampleslimits the number of training examples cached. Omit it to cache the full dataset. The cache is stored as one.safetensorsfile per sequence chunk, keyed by SHA-256 ofinput_ids. Re-running with the same arguments skips already-cached chunks.
- Run training (single 32 GB GPU):
CUDA_VISIBLE_DEVICES=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
nohup .venv/bin/torchrun --nproc_per_node=1 \
tasks/train_torch.py configs/pretrain/qlorafy_27b_train.yaml \
> /tmp/qlorafy_train.log 2>&1 &
echo $! > /tmp/qlorafy_train.pid
# Monitor:
tail -f /tmp/qlorafy_train.logBefore launching, edit configs/pretrain/qlorafy_27b_train.yaml to point to your local paths:
model:
model_path: /path/to/Qwen3.6-27B # step 1
data:
train_path: /path/to/data.jsonl # step 2
eval_size: 50 # hold out 50 examples for perplexity eval
train:
anchor_cache_dir: /path/to/anchors/qwen3.6-27b # step 3
eval_every: 100 # run eval every 100 steps
wandb_project: your-wandb-project
wandb_name: qlorafy-27b-run1| Setting | Value | Why |
|---|---|---|
enable_qlorafy: true |
NF4 base + LoRA r=32 | 27B → ~7 GB in VRAM, r=32 fits 32 GB GPU |
language_model_only |
Auto-set by qlorafy.py | Loads text-only Qwen3_5ForCausalLM, skips 4.7 GB vision encoder |
repr_align_wt: 1.0 |
Alignment loss weight | Bidirectional adaptation |
align_layers: "16,32,48,64" |
4 of 64 layers | Matches anchor cache; use all 64 for production |
repr_align_sub_sample_ratio: 0.25 |
25% of tokens | 4× gradient memory reduction |
save_epochs: 0 |
Skip DCP checkpoint | DCP can't serialize Params4bit |
eval_size: 50 |
Hold out 50 examples | Perplexity eval every eval_every steps |
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
Required for r=32 | Reduces memory fragmentation on 32 GB GPU |
| Rank | LoRA Params | Trainable % | Fits RTX 5090 (32 GB)? |
|---|---|---|---|
| 16 | 73M | 0.27% | Yes (24 GB) |
| 32 | 147M | 0.54% | Yes, needs expandable_segments:True (28 GB) |
| 64 | 294M | 1.09% | No — OOMs during first forward pass |
| 128 | 587M | 2.17% | No |
| Metric | Value |
|---|---|
| Peak VRAM | ~28 GB allocated |
| Speed | ~19 s/step (micro_batch=1, 16 grad_accum) |
| Throughput | ~440 tok/s, MFU ~19% |
| Loss (20 steps) | 5.2 → 5.1 (stabilizing) |
| Grad norm | 59 → 2.4 (rapidly converging) |
| Eval | Perplexity logged to wandb every 100 steps |
Checkpoint limitation: DCP (
torch.distributed.checkpoint) cannot serializeParams4bitobjects from bitsandbytes. Setsave_epochs: 0during training. To save LoRA weights, usesave_hf_weights: true(exports PEFT adapter weights only, not the NF4 base).Wandb metrics logged:
training/loss,training/grad_norm,training/lr,qlora/grad_norm,qlora/param_norm,qlora/grad_to_param_ratio,eval/loss,eval/perplexity,flops_achieved(T),flops_promised(T),mfu,tokens_per_second,system/vram_allocated_gb,system/vram_reserved_gb. Generation probe every 100 steps viageneration/sample.
Open-dLLM supports three approaches for converting an autoregressive LM into a diffusion LM.
Uses pre-computed entropy trajectories from the model itself to guide training-time masking. Trains faster and with better convergence than random masking. See d3LLM Training section for the full recipe using Qwen3.6-27B with QLoRA.
Paper: Don't Retrain—Align: Adapting AR LMs to Diffusion LMs via Representation Alignment
The key insight: AR models like Qwen already learn strong language representations. You don't need to retrain from scratch — just preserve those representations while switching from causal (left-to-right) to bidirectional (any-order) generation.
How it works:
- Load a pretrained AR model (e.g., Qwen3.6-35B-A3B)
- Flip the attention mask from causal → bidirectional (this is the "student")
- Keep a frozen copy as the "teacher" (causal attention, clean input)
- Train with two losses:
- Masked denoising loss: Randomly mask tokens → student predicts them using bidirectional context
- Representation alignment loss: Cosine similarity between student and teacher hidden states at every layer
Why it's faster:
- No new architecture to train — uses the existing model weights directly
- 3-4× faster convergence vs. training from scratch (per the paper)
- Works on tiny datasets (as low as 0.8B tokens)
- Optional
freeze_layers: "mlp"gives ~2× throughput with minimal quality loss
Quick start (2 GPUs):
export TOKENIZERS_PARALLELISM=false
torchrun --nproc_per_node=2 tasks/train_torch.py \
configs/pretrain/qwen2_5_coder_500M.yaml \
--data.train_path=/run/media/johndpope/12TB/open_dllm/ldlm_data/data.jsonl \
--model.model_path=Qwen/Qwen3.6-35B-A3B \
--train.enable_masking=true \
--train.repr_align_wt=1.0 \
--train.micro_batch_size=1 \
--train.global_batch_size=16 \
--train.output_dir=/run/media/johndpope/12TB/open_dllm/checkpoints/35b_a3b_repr_align \
--train.save_steps=500Repr-Align alignment loss scales with the number of layers and sequence length — at 27B with 64 layers and long sequences, computing cosine similarity for every layer every step becomes non-trivial. Two independent knobs reduce this cost.
What the knobs do:
| Knob | YAML field | Effect |
|---|---|---|
| Token subsampling | repr_align_sub_sample_ratio: 0.25 |
Random 25% of positions each step → 4× fewer alignment gradient tokens |
| Layer subsampling | repr_align_num_sample_layers: 4 |
Random 4 of N configured layers each step → N/4 fewer alignment losses |
Both are unbiased gradient estimates — every position/layer is covered over time. The hook-based implementation (not output_hidden_states=True) means gradient checkpointing is preserved for non-alignment layers.
Validated setup — all layers in pool, subsampled:
# Step 1: precompute anchor cache for all 28 layers, 20-example smoke set
CUDA_VISIBLE_DEVICES=0 .venv/bin/python scripts/precompute_anchor.py \
--model_path Qwen/Qwen3-1.7B \
--data_path /tmp/smoke_20.jsonl \
--output_dir /home/johndpope/ds_offload/anchors/qwen3-1.7b-all28-smoke20 \
--layers 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28 \
--max_seq_len 2048
# → 20 chunks, 1 GB, ~1s
# Step 2: run training smoke test (5 steps)
CUDA_VISIBLE_DEVICES=0 .venv/bin/torchrun --nproc_per_node=1 \
tasks/train_torch.py \
configs/pretrain/qwen3_1_7b_alllayers_subsample_smoke.yamlConfig (configs/pretrain/qwen3_1_7b_alllayers_subsample_smoke.yaml):
train:
align_layers: "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28"
repr_align_num_sample_layers: 4 # 4 of 28 sampled each step
repr_align_sub_sample_ratio: 0.25 # 25% of tokens
data_parallel_mode: deepspeed
ds_zero_stage: 2
ds_offload_optimizer: cpu
optimizer: adamw
enable_gradient_checkpointing: trueMeasured results — Qwen3-1.7B, RTX 5090, 5 steps:
| Step | loss | repr_align | grad_norm |
|---|---|---|---|
| 1 | 13.12 | 0.56 | 0.00 |
| 2 | 13.19 | 0.62 | 127.02 |
| 3 | 11.88 | 0.72 | 177.57 |
| 4 | 11.69 | 0.63 | 177.57 |
| 5 | 8.66 | 0.52 | 177.57 |
| Config | Peak VRAM |
|---|---|
| All 28 layers, sample 4, ratio 0.25 | 9.34 GB |
| Alignment OFF (baseline) | 9.34 GB |
Finding: at 1.7B scale, the subsampling is effectively free. The alignment tensors (4 layers × ~500 tokens × 2048 hidden × bf16 ≈ 8 MB) are negligible against the 9+ GB model + optimizer footprint. No measurable VRAM difference.
Where the savings are expected to matter — 27B (unverified):
At 27B, each layer hidden state is 5120-wide. Full alignment on all 64 layers at seq=2048 would be:
- 64 layers × 5120 × 2048 × 2 bytes = 1.3 GB of alignment activations per step
- With gradient accumulation, these accumulate
With 4-of-64 layer sampling + 0.25 token ratio:
- 4 × 5120 × 512 × 2 bytes = 21 MB → ~60× reduction
This 60× figure is calculated, not measured. Whether it translates to a real training OOM difference on the cloud 27B setup (2× RTX PRO 6000, ZeRO-3) has not been validated. The 1.7B results confirm correctness (no NaN, gradient coverage) but not VRAM impact. Verification requires running cloud_27b.yaml with and without subsampling and comparing step logs.
Bugs fixed in this work:
all_reduceon a single-element tuple returned a scalar, crashing single-component loss configs (e.g. pure MDM with no alignment) at step 2. Fixed intasks/train_torch.py.
Paper: Latent Diffusion Language Models
Trains new components from scratch (Perceiver encoder/decoder + diffusion head) on top of a frozen AR encoder. More expressive but significantly more expensive — requires training 1.39B-6.75B new parameters.
See the full LDLM section below for details.
| Repr-Align | LDLM | |
|---|---|---|
| New parameters | 0 (reuses AR model) | 1.39B–6.75B |
| Training speed | 3-4× faster | Baseline |
| Data needed | As low as 0.8B tokens | More data beneficial |
| Architecture change | Attention mask only | New Perceiver + DiT head |
| When to use | Default choice for converting existing models | When you need latent-space diffusion |
Bottom line: If you have an off-the-shelf AR model and want diffusion capabilities with minimal compute, use Repr-Align. It's already built into the Qwen3.6 model implementations (
modeling_qwen3_5_moe.py,modeling_qwen3.py,modeling_qwen2.py).
Open-dLLM implements d3LLM (ICML 2026) trajectory-guided masking for MDM training. Instead of uniformly random masks, the pre-computed unmasking order from the teacher model determines which tokens are masked at each training step — aligning training-time masking with inference-time decoding behavior.
Key insight: The trajectory captures which response tokens the model is confident about first (lowest entropy). Those tokens are decoded early during inference and should be predicted first during training. See d3LLM Training section for the full end-to-end recipe.
Key differences from the replay buffer:
- Replay buffer stores past batches to prevent forgetting (uniform sampling)
- Trajectory distillation uses the teacher's inference-time unmasking order to guide masking (curriculum learning)
- They are complementary — both can be enabled simultaneously
Pipelined parallel decoding (ICML 2026) — inference-side counterpart to trajectory-guided masking. Up to ~5× speedup over AR decoding via block-causal attention, entropy-thresholded token selection, and pipelined block progression.
See docs/multi_block_decoder.md for full API, usage, and current status (KV-cache 🔴 blocked, trajectory-aware 📝 future).
A Perceiver-based latent diffusion approach (arXiv:2605.07933) that jointly trains a latent encoder, diffusion model, and decoder on top of a frozen pre-trained LM.
See docs/ldlm.md for architecture comparison table (paper vs 35B-A3B vs 27B), training recipe (MSE loss, warmup, adaptive timestep sampling), inference benchmarks (up to 6,500 tok/s on 35B-A3B), and step-by-step training instructions.
┌──────────────────────────────────┐
│ AR Foundation Models │
│ (Qwen2 / Qwen3 / Qwen3.5 │
│ Gated DeltaNet / MoE) │
└───────────┬──────────────────────┘
│ frozen anchor
▼
┌──────────────────────────────────┐
│ CachedTeacher │
│ precompute_anchor.py │
│ 4-64 layers, up to 160K ctx │
│ (2.7 TB for 100K @ 4 layers) │
└────┬──────────┬──────────┬───────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Repr- │ │ LDLM │ │ VFM │
│ Align │ │ train_ │ │ train_ │
│ train_ │ │ ldlm.py │ │ vfm.py │
│ torch.py │ │ 1.39- │ │ ~100M │
│ 0 new │ │ 6.75B │ │ adapter │
│ params │ │ new │ │ │
│ (1147 │ │ params │ │ (923 │
│ tok/s) │ │ (951 │ │ tok/s) │
│ │ │ tok/s) │ │ │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
├────────────┼────────────┤
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────┐
│ d3LLM Trajectory Guidance │
│ trajectory_extractor.py │
│ (entropy + LR modes, 16-256 steps) │
│ 1.7B: +4.6% tok/s (1183) │
└────────────────┬────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Inference: mdm_generate │
│ + multi_block_generation.py │
│ 1.7B: 1131-1183 tok/s (8 steps) │
│ 27B: 115 tok/s (8 steps, NF4) │
│ Per-step: ~138ms (27B NF4 5090) │
└─────────────────────────────────────┘
Parent Nodes: Repr-Align paper, d3LLM ICML 2026, LDLM paper, Cola DLM
New Node Type: Synthesis — Comparison Grid + Infrastructure Maturation
Claim: The systematic comparison grid (7 configs, 50 examples, wandb-logged) establishes which diffusion path wins for given hardware/quality budgets. Repr-Align dominates for speed+quality; d3LLM trajectories add marginal (~4.6%) inference speedup on 1.7B; LDLM/VFM trade throughput for architectural flexibility. The chunked CE fix unlocks 4096-seq-len training.
Validation Plan: Run the full comparison at 27B scale (blocked on compute — see L2).
| # | Direction | Target Metric | Status | Rationale |
|---|---|---|---|---|
| L1 | Reduce per-step cost (KV-cache, fused kernels) | ≥2× tok/s (27B: 115→230) | 🟢 active | ~138ms/step is model-bound; KV-cache or fused DeltaNet attention could halve it. Highest single-lever gain. |
| L2 | 27B comparison grid (reproduce 1.7B findings at scale) | ppl ≤ 2.0, ≥0.7× baseline throughput | 🟡 blocked (compute) | The 1.7B findings need verification at 27B. Requires 2× Blackwell or cloud rental. |
| L3 | d3LLM trajectory training at 27B (4K ctx, QLoRA) | ppl vs random-mask baseline | 🟢 active | Configs d3llm_27b_4k.yaml and d3llm_27b_100_traj.yaml exist. Trajectories precomputable via precompute_trajectories.py --mode entropy --quantize 4bit. |
| L4 | Chunked cross-entropy for long context (seq_len > 2K) | Stable training at 4K+ ctx | ✅ done | Landed in hf_mdm_qlora.py:_mdm_loss(). Enables 4096-seq-len training without OOM. |
| L5 | Cola DLM training + eval | ppl, tok/s vs Repr-Align baseline | 🟡 blocked (need results) | configs/pretrain/compare_50x_cola.yaml exists. Hierarchical VAE+DiT head on Repr-Align. No benchmark results yet. |
| L6 | Full 64-layer alignment on 27B (verify 60× memory ratio) | Expected: 21 MB vs 1.3 GB alignment activations | 🟡 blocked (compute) | Verified on 1.7B (zero VRAM difference). Ratio calculated, not measured. Requires 27B run. |
| L7 | VFM training convergence | ppl vs Repr-Align at equal step count | 🟡 blocked (need results) | compare_50x_vfm.yaml exists. Noise adapter approach. No convergence data yet. |
| L8 | Multi-block KV-cache | Unblock multi-block path (currently blocked) | 🔴 blocked | HF cache API incompatible with block-causal masks. Requires custom cache implementation. |
Overall Confidence: 0.75
Weakest Link: L2 (27B comparison grid) — all other directions are blocked until compute is available for at-scale validation. The 1.7B findings are credible but limited in scope.
To increase confidence: Run L3 (d3LLM 27B training) as the next active step — it uses existing configs and QLoRA fits on single 5090. Results would validate trajectory guidance at scale.
See docs/hardware.md for:
- Minimum / recommended / cloud hardware specs
- RAM budget breakdown for 27B ZeRO-3 (~170 GB peak during init)
- Verified working setups (1.7B Repr-Align on 5090, 27B anchor precompute across 2 GPUs)
- Known blockers (27B on 96GB RAM, 2-GPU ZeRO-3 RAM ceiling)
- Hobby RAM vs Cloud H100 cost comparison (break-even at ~65 hrs)
- DeepSpeed NVMe offload gotchas (buffer_size, async_io build, pin_memory patch)
This project builds on incredible prior work:
- Frameworks & Tooling: VeOmni, lm-eval-harness
- Open-source dLLMs: LLaDA, Dream
- Pioneering dLLMs: Gemini-Diffusion, Seed Diffusion, Mercury
- Foundational research: MD4, MDLM, DPLM
We stand on the shoulders of these projects, and hope Open-dLLM contributes back to the diffusion LLM community.
If you use Open-dLLM or Open-dCoder in your research, please cite us:
@misc{opendllm2025,
title = {Open-dLLM: Open Diffusion Large Language Models},
author = {Fred Zhangzhi Peng, Shuibai Zhang, Alex Tong, and contributors},
year = {2025},
howpublished = {\url{https://github.com/pengzhangzhi/Open-dLLM}},
note = {Blog: \url{https://oval-shell-31c.notion.site/Open-Diffusion-Large-Language-Model-25e03bf6136480b7a4ebe3d53be9f68a?pvs=74},
Model: \url{https://huggingface.co/fredzzp/open-dcoder-0.5B}}
}