Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions records/track_non_record_16mb/2026-04-06_Flow_Matching/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Flow Matching Language Model

This is a non-record submission that replaces the autoregressive `train_gpt.py` baseline with a flow matching language model implemented in `train_gpt.py` (this folder). The model is based on ["Flow Matching for Conditional Text Generation in a Few Sampling Steps"](https://aclanthology.org/2024.eacl-short.33.pdf) (EACL 2024).

The model keeps much of the original training stack (data loading, quantization, distributed training, Muon optimizer infrastructure), but replaces the causal next-token objective with a continuous-flow denoising objective over token embeddings, conditioned on a source context. It was trained on 8 x H100 for 10 minutes.

## What Changed

The baseline GPT is replaced by three new components:

- **`TransformerTimestepModel`** — a bidirectional (non-causal) transformer with sinusoidal timestep embeddings injected at each layer. Position embeddings are added alongside the timestep signal.
- **`TransformerEncoderLayer`** — standard bidirectional multi-head attention + FFN block with LayerNorm, GELU, and dropout=0.1.
- **`Flow`** — the top-level model wrapping the transformer. Implements the flow matching objective and variational BPB evaluation.

The optimizer is switched from Muon to AdamW (lr=1e-5), since the flow matching objective is not a classification cross-entropy and benefits from a simpler optimizer.

## Flow Matching Objective

Each training sequence of length `TRAIN_SEQ_LEN` is split in half:
- **Source**: first `TRAIN_SEQ_LEN // 2` tokens — used as clean context, never noised.
- **Target**: second `TRAIN_SEQ_LEN // 2` tokens — the tokens to predict/generate.

At each training step:
1. Sample `t ~ U[0, 1]` per batch element.
2. Interpolate the target token embeddings toward Gaussian noise: `x_t = (1 - t) * noise + t * target_embs` (linear interpolation flow path).
3. Concatenate clean source embeddings with the noisy target embeddings and pass to the transformer with timestep `t`.
4. The model predicts the velocity field `v_pred` over the full sequence; only the target portion is supervised.
5. Estimate the denoised embedding: `z1_hat = x_t + (1 - t) * v_pred_tgt`.
6. Combined loss: **MSE on the velocity field** (flow regression) + **cross-entropy on `z1_hat` projected through the embedding matrix** (token anchor loss).

## Config

- **Tokenizer/data**: reuses FineWeb SP-1024; no extra tokens needed (embedding space is continuous).
- **Layout**: `vocab_size=1024`, `model_dim=512`, `num_layers=6`, `num_heads=8`, `max_seq_len=1024`. Slightly shallower than the 9-layer baseline to fit within the 16 MB compressed limit.
- **Attention**: bidirectional (no causal mask), standard multi-head attention with dropout=0.1.
- **Timestep conditioning**: sinusoidal embedding of `t` projected through a two-layer MLP (`dims → 4*dims → dims`), broadcast-added to every sequence position before the transformer.
- **TRAIN_SEQ_LEN=1024**: sequence is split 512/512 between source and target.
- **VAR_EVAL_STEPS=32**: validation uses 32 evenly spaced timestep samples for the Monte Carlo BPB estimate.
- **Optimizer**: AdamW, lr=1e-5, β=(0.9, 0.95), weight_decay=0.01.
- **All else** (batch size, distributed setup, quantization, data loading) is identical to the baseline.

## Metrics

- **Training loss** is the sum of MSE velocity loss and CE anchor loss — not directly comparable to the baseline cross-entropy.
- **val_bpb** is a Monte Carlo estimate of average CE over noise levels, converted to bits/byte. It is **not apples-to-apples** with the autoregressive baseline BPB.

## val_bpb Computation

For each of `VAR_EVAL_STEPS=32` evenly spaced `t ∈ [0, 1]`:
1. Sample noise, compute the linearly interpolated `x_t`.
2. Run the transformer to get `v_pred`, estimate `z1_hat = x_t + (1-t)*v_pred`.
3. Compute token CE between `z1_hat` projected through the embedding matrix and the target.
4. Average CE across timesteps → nats/token → bits/byte via `tokens_per_byte`.

## Things That Didn't Work / Notes

- Flow matching on token embeddings is fundamentally harder than on continuous data (e.g., images) — the embedding space has no natural density, so the model must both move in the right direction and land near the right token.
- The MSE + CE combined loss is unstable early in training; the loss starts very high (~20) and drops slowly.
- The model only hits ~2500 steps in the 10-minute wallclock cap on 8×H100 NVL, vs ~20 000 steps for the AR baseline — significantly fewer updates per wall-clock second due to the bidirectional attention over double the sequence length.
- val_bpb is not directly comparable to AR baselines.

## Files

- `train_gpt.py` — single-file flow matching training/eval script
- `log_run1.txt`, `log_run2.txt`, `log_run3.txt` — training logs (3 seeds) on 8×H100 NVL
- `submission.json`
- `README.md`

## Metrics

- **val_bpb**: 3.6674 (mean of 3 runs, post-quant int8+zlib roundtrip)
- **step_stop**: 2534 (wallclock cap at 10 min)
- **wallclock_seconds**: 600.379
- **compressed artifact size**: 10,976,330 bytes (int8+zlib)
- **total submission size**: 44,224,625 bytes (raw model + code)
- **model parameters**: 22,063,616

Although the val_bpb is not directly comparable to autoregressive baselines, the model is clearly learning: val_bpb drops from 7.05 → 3.72 → 3.67 over the first 2000 steps (see `log_run1.txt`). However, a flow matching model would likely have to be trained for many more epochs to be a useful language model.
Loading