[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD#1502
Open
SPThole wants to merge 12 commits intoopenai:mainfrom
Open
[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD#1502SPThole wants to merge 12 commits intoopenai:mainfrom
SPThole wants to merge 12 commits intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR presents a theoretically-grounded redesign of FOMAML meta-TTT that
addresses every identified flaw from #1501 ablation — and demonstrates that the
TTT ceiling is architecture-limited, not initialization-limited. Three training
procedures (original FOMAML, no meta-TTT, redesigned FOMAML) all produce the same
~0.023 bpb TTT delta, proving the ceiling is set by the bank dimensionality and TTT
optimizer, not by meta-training.
See also: [PR: Position-Conditional Bigram + Ablation] #1501 ,
which introduces the base architecture and proves FOMAML meta-TTT adds only
+0.00036 bpb in its original formulation.
TL;DR — Key Learnings for the Community
TTT adaptation ceiling is architecture-limited. Three different meta-training
objectives — same-batch FOMAML, no meta-training, and cross-chunk FOMAML with
Δ-loss — all produce the same ~0.023 bpb TTT improvement. No meta-training
objective can move this ceiling. To raise it, you need more adaptable parameters
(more bank layers, LoRA-style correctors) or a better TTT optimizer (Adam,
more epochs, higher LR).
Three different training procedures find equidistant solutions in weight space
with identical local curvature. Bank condition numbers (1.03–1.38), effective
ranks (22 for attention, 11 for MLP), and energy distributions are identical
across all three models. The loss landscape is degenerate: many equivalent
minima exist, meta-TTT selects which one you land in, but the TTT adaptation
surface looks the same from every minimum.
MetaSGD per-layer LR learning needs a stronger signal. 66 learned per-bank-
per-layer learning rate scales all converged to their 1.0 initialization. One
meta-step every 4 training steps is too infrequent, and the meta-gradient is too
weak relative to the main task gradient, to drive per-layer differentiation.
Cross-chunk FOMAML is less disruptive than same-batch FOMAML. Subspace
overlap analysis shows the no-meta model and cross-chunk model share 73%
functional subspace, vs only 62% between the no-meta and same-batch models.
The biased same-batch meta-gradient systematically rotates the MLP input
subspace; the unbiased cross-chunk variant preserves it.
Always measure the TTT delta, not just the final score. If we'd only
compared final legal_ttt numbers, we might have concluded exp106's float-path
1.11469 was better than exp101's 1.11588. But the delta tells the real story:
exp106's better float baseline (1.1377) compensates for fewer training steps,
while the TTT improvement itself is the same.
NOTE: exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start
[0,2047)and within-word[2047,4094)halves keyed onhas_leading_space[current_token], eliminating bucket contention that forced the parent model'sword_start_boostgate to 0.007 → legal_ttt 1.11588 (zero extra params).exp105a: Single-flag ablation (
META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.Disclaimer
Hardware: All runs use a single H100 80 GB SXM GPU with
MAX_WALLCLOCK_SECONDS=4800(80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
standard 8×H100 @ 10 min budget at substantially lower cost.
Early stopping due to wallclock: exp106 completed 6686 of 7500 steps —
~11% fewer than the ablation (exp105a: 7226 steps). This is because MetaSGD's
extra gradient storage (+8.6 GB peak memory) slowed each step from ~663 ms to
~718 ms, consuming the 80-minute budget faster. The model was still in the
warmdown phase when stopped.
Int6 canonical eval crashed: After GPTQ quantization,
eval_model.load_state_dict()failed with
RuntimeError: Missing key(s): "meta_sgd_qo", "meta_sgd_kv", "meta_sgd_up", "meta_sgd_down"because the 66 MetaSGD parameters were correctly excluded from the16 MB export but
GPT.__init__still registers them. This meant the in-script int6roundtrip evaluation and canonical legal_ttt could not run. A hotfix was applied to
the standalone
ttt_from_checkpoint.pyharness, which produced the float-path andpartial int6 numbers reported here. Where int6 canonical values are unavailable, they
are marked "—".
Non-record: This experiment is a non-record exploration (
non_record: true). Itexists to answer the question "can a better meta-TTT formulation move the TTT ceiling?"
Cost constraint: GPU time was limited. The partial int6 TTT run (80% complete) was
terminated when the trajectory showed no convergence trend different from the baseline.
Projected final value is ~1.118, consistent with the invariant ~0.023 delta.
Architecture Overview
Base Architecture
This experiment shares the identical architecture as PR #1501 (exp101). We reproduce
the full specification here for self-containment.
lots of inspiration from : PR #1019
tok_emb = lm_head^T)qo_bank,kv_bank,mlp_up_bank,mlp_down_bank) are the parameters adapted during TTT.Training Pipeline
Quantization and TTT
Same pipeline as PR #1501 : GPTQ int6 (attn+MLP) / int8 (embed) → LZMA → 16 MB.
TTT: SGD + cosine LR, momentum 0.9, 4 epochs, 947 chunks × 65K tokens.
Scoring: score-first-then-adapt (
legal_ttt).Innovation — What This PR Introduces
Motivation: Why Meta-TTT Needed a Redesign
PR #1501's ablation (exp105a) proved that exp101's FOMAML meta-TTT adds only +0.00036
bpb. But the concept of meta-TTT — training the model to adapt faster at test
time — is theoretically sound (MAML-style learning works in the meta-learning
literature). The failure had three identifiable structural causes:
L(banks'; X)inner_lr = 0.002Innovation A: Cross-Chunk Split
Split the training batch
B(shape[batch, seq_len]) along the batch dimensioninto two halves. The first half provides the inner-loop adaptation data, the second
half provides the outer-loop evaluation data:
Because the dataloader draws independent random sequences from fineweb10B,
B_first_halfand
B_second_halfcome from different documents. This matches the TTT deploymentregime: adapt on document
i, get scored on documentj.Fallback: When per-GPU batch size = 1 (not our case, but handled), falls back
to sequence-half split (first/last 1024 tokens of the same sequence).
Innovation B: Delta-Loss Outer Objective
Instead of optimizing absolute post-adaptation loss, we add a term that explicitly
rewards the improvement from the inner step:
Expanding:
L_meta = 0.5 · L_post + 0.3 · (L_post − L_pre)The second term is the adaptation delta: it directly penalizes banks where the
inner step makes things worse and rewards banks where it helps. A bank that starts
with low loss but doesn't improve gets penalized by the
−w_Δ · L_preterm.Cost: One extra forward pass per meta-step (computing
L_pre).Innovation C: MetaSGD — Learned Per-Layer LR Scales
For each bank type
k ∈ {qo, kv, up, down}and each layerl:where
s[k, l] ∈ R+is a learned scalar initialized to 1.0. Shapes:meta_sgd_qometa_sgd_kvmeta_sgd_upmeta_sgd_downIf meta-TTT works, different layers should learn different scales — e.g., shallow
attention layers might need larger inner-loop steps than deep MLP layers. The
scales are registered as
nn.Parameterand receive gradients via the outer lossbackprop. They are excluded from the exported
final_model.ptandfinal_model.int6.ptzto preserve the 16 MB budget.Implementation detail: The inner-loop update is built as a differentiable
non-leaf tensor so a single backward pass populates both MetaSGD scale gradients
(via leaf autograd) and bank FOMAML gradients (via
retain_grad+ manual copy).Results
exp106 — Meta-TTT Redesign
meta_sgd_*strict-load RuntimeErrorInt6 TTT Trajectory (partial, 80% complete)
Baseline (int6 canonical): 1.14160. Running delta at 80%: −0.02362.
The trajectory is flat in the 66–80% range. Projected final: ~1.118.
MetaSGD Scale Convergence
All 66 learned LR scales converged to values near their 1.0 initialization:
Interpretation: No per-layer differentiation was learned. The meta-training
signal (1 meta-step per 4 training steps, at ~30% of main gradient magnitude)
is too weak to push 66 scalar parameters away from their initialization over
6686 training steps.
Analysis — Complete Meta-TTT Lineage (All Three Experiments)
This section summarizes the findings across all three experiments in this series.
A reader who sees only this PR should be able to understand the full meta-TTT story.
The Three Experiments
META_TTT_ENABLED=0)*Float-path TTT; int6 canonical unavailable due to strict-load crash.
The Central Finding: TTT Delta Invariance
The TTT delta is −0.023 ± 0.001 bpb across all variants. Three different
training objectives — from "no meta-signal" to "theoretically correct cross-document
generalization reward" — produce the same adaptation improvement.
Three-Way Weight-Space Analysis
We ran 8 analyses comparing all three models pairwise (script:
supporting_files/analysis_three_way.py, CPU-only, 3.6s on M2):Triangle Geometry: Equidistant Solutions
All three models are approximately the same distance from each other. Meta-TTT
doesn't push you in a consistent direction — it pushes you to a random neighboring
basin, and the specific basin depends on the meta-gradient formulation.
Subspace Overlap: Cross-Chunk Preserves the Natural Subspace
The redesigned cross-chunk FOMAML (exp106) produces a solution closer in
functional subspace to the no-meta baseline than the original same-batch
FOMAML (exp101) does. The biased same-batch meta-gradient rotates the subspace
more than the unbiased cross-chunk variant.
Most striking:
mlp_up_banksubspace cosine is 0.949 between exp105a andexp106 (nearly identical) but only 0.551 between exp101 and exp105a (half-
rotated). Same-batch FOMAML systematically distorts the MLP input features.
Error Surface: Identical Curvature at All Three Minima
This is why the TTT delta is invariant: the local curvature of the loss
landscape — the surface that SGD navigates during TTT — is identical at all three
minima. SGD makes the same progress per step from any starting point.
Mode Connectivity: Three Distinct Basins
The three models occupy distinct local minima. exp105a and exp106 are closest
to being in the same basin (ratio 0.807, threshold ~0.8), consistent with
cross-chunk FOMAML being less disruptive than same-batch FOMAML.
Why Meta-TTT Cannot Move the Ceiling
The argument from curvature invariance: TTT improvement depends on (1) how
far SGD can move the banks in 4 epochs (fixed by TTT config) and (2) how much
loss reduction each step buys (determined by local curvature). We showed the
curvature is identical at all three minima. Therefore the TTT delta must be
identical — QED.
The argument from over-parameterization: The training loss surface has a
degenerate set of equivalent minima (the three models prove this). Meta-TTT
selects a different minimum but cannot escape the set. All minima in the set
have the same curvature and the same TTT potential. To escape, you'd need a
stronger perturbation: second-order MAML, many more inner steps, or a dedicated
meta-training phase after warmdown.
The argument from MetaSGD: If per-layer LR differentiation could help, the
66 MetaSGD scales should have diverged from their 1.0 initialization. They
didn't. The meta-gradient signal at 1 step per 4, with loss weight 0.5, is
too weak to drive 66 scalar parameters in 6686 training steps.
Learnings for the Community
The TTT adaptation ceiling is set by architecture, not initialization.
~0.023 bpb is invariant across three FOMAML variants (same-batch, none,
cross-chunk + Δ-loss + MetaSGD). To improve TTT, change the bank dimensionality
or the TTT optimizer — not the training-time meta-objective.
First-order MAML with 1 inner step on a well-trained model ≈ gradient noise.
After 6000+ training steps, the banks are near a local optimum. A single inner
SGD step barely perturbs them, so the FOMAML outer gradient carries near-zero
functional signal regardless of how the inner/outer data is split.
Cross-chunk FOMAML is less harmful than same-batch FOMAML (even though both
are useless for TTT). Same-batch FOMAML introduces a systematic directional
bias that rotates the MLP input subspace 45° from the natural optimum. Cross-
chunk FOMAML's unbiased meta-gradient preserves the natural subspace (cos 0.95).
MetaSGD needs a stronger signal to learn meaningful per-layer differentiation.
At 1 meta-step per 4 training steps with loss weight 0.5, the effective meta-
gradient energy is ~7.5% of total gradient. This is insufficient to drive 66
scalar parameters away from their initialization over 6686 steps.
Three equivalent minima with identical local curvature — the loss landscape
of a Muon-trained 27M-param transformer has a degenerate set of solutions.
Meta-learning perturbations select among them but cannot improve them. This
is consistent with overparameterization theory and with empirical results from
lottery ticket and mode connectivity research.
Measure the delta, not the score. If we'd only compared final bpb numbers,
exp106's 1.11469 looks better than exp101's 1.11588. But the TTT delta
(architecture-level metric) is the same. The per-experiment score difference comes
from different pre-TTT baselines (1.1377 vs 1.1393), which are driven by the
number of training steps completed, not by meta-TTT quality.
Related PRs
the base architecture (position-conditional bigram hashing, a zero-parameter trick
that improves legal_ttt by 0.001 bpb) and the controlled ablation proving same-batch
FOMAML meta-TTT contributes only +0.00036 bpb. The ablation finding is the
motivation for this PR's redesign.
Folder Structure