Flow Matching Language Model for Text Generation#1503
Open
vukadinovic936 wants to merge 3 commits intoopenai:mainfrom
Open
Flow Matching Language Model for Text Generation#1503vukadinovic936 wants to merge 3 commits intoopenai:mainfrom
vukadinovic936 wants to merge 3 commits intoopenai:mainfrom
Conversation
Collaborator
|
woah what |
Collaborator
|
variational bpb 🤣 this is great stuff |
Collaborator
|
This is super cool. Would encourage you to keep working on this, and we'll merge it in if it gets better. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Flow Matching Language Model
This is a non-record submission that replaces the autoregressive
train_gpt.pybaseline with a flow matching language model implemented intrain_gpt.py(this folder). The model is based on "Flow Matching for Conditional Text Generation in a Few Sampling Steps" (EACL 2024).The model keeps much of the original training stack (data loading, quantization, distributed training, Muon optimizer infrastructure), but replaces the causal next-token objective with a continuous-flow denoising objective over token embeddings, conditioned on a source context. It was trained on 8 x H100 for 10 minutes.
What Changed
The baseline GPT is replaced by three new components:
TransformerTimestepModel— a bidirectional (non-causal) transformer with sinusoidal timestep embeddings injected at each layer. Position embeddings are added alongside the timestep signal.TransformerEncoderLayer— standard bidirectional multi-head attention + FFN block with LayerNorm, GELU, and dropout=0.1.Flow— the top-level model wrapping the transformer. Implements the flow matching objective and variational BPB evaluation.The optimizer is switched from Muon to AdamW (lr=1e-5), since the flow matching objective is not a classification cross-entropy and benefits from a simpler optimizer.
Flow Matching Objective
Each training sequence of length
TRAIN_SEQ_LENis split in half:TRAIN_SEQ_LEN // 2tokens — used as clean context, never noised.TRAIN_SEQ_LEN // 2tokens — the tokens to predict/generate.At each training step:
t ~ U[0, 1]per batch element.x_t = (1 - t) * noise + t * target_embs(linear interpolation flow path).t.v_predover the full sequence; only the target portion is supervised.z1_hat = x_t + (1 - t) * v_pred_tgt.z1_hatprojected through the embedding matrix (token anchor loss).Config
vocab_size=1024,model_dim=512,num_layers=6,num_heads=8,max_seq_len=1024. Slightly shallower than the 9-layer baseline to fit within the 16 MB compressed limit.tprojected through a two-layer MLP (dims → 4*dims → dims), broadcast-added to every sequence position before the transformer.Metrics
val_bpb Computation
For each of
VAR_EVAL_STEPS=32evenly spacedt ∈ [0, 1]:x_t.v_pred, estimatez1_hat = x_t + (1-t)*v_pred.z1_hatprojected through the embedding matrix and the target.tokens_per_byte.Things That Didn't Work / Notes
Files
train_gpt.py— single-file flow matching training/eval scriptlog_run1.txt,log_run2.txt,log_run3.txt— training logs (3 seeds) on 8×H100 NVLsubmission.jsonREADME.mdMetrics
Although the val_bpb is not directly comparable to autoregressive baselines, the model is clearly learning: val_bpb drops from 7.05 → 3.72 → 3.67 over the first 2000 steps (see
log_run1.txt). However, a flow matching model would likely have to be trained for many more epochs to be a useful language model.