Skip to content

[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD#1502

Open
SPThole wants to merge 12 commits intoopenai:mainfrom
SPThole:non_record_4
Open

[Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD#1502
SPThole wants to merge 12 commits intoopenai:mainfrom
SPThole:non_record_4

Conversation

@SPThole
Copy link
Copy Markdown

@SPThole SPThole commented Apr 9, 2026

Track: 10min_16mb (Track B, score-first-then-adapt) | Hardware: 1×H100 80 GB SXM
Float-path legal_ttt: 1.11469 | TTT delta: −0.02299 bpb
Status: Non-record exploration (int6 canonical eval crashed; see Disclaimer)

This PR presents a theoretically-grounded redesign of FOMAML meta-TTT that
addresses every identified flaw from #1501 ablation — and demonstrates that the
TTT ceiling is architecture-limited, not initialization-limited. Three training
procedures (original FOMAML, no meta-TTT, redesigned FOMAML) all produce the same
~0.023 bpb TTT delta, proving the ceiling is set by the bank dimensionality and TTT
optimizer, not by meta-training.

See also: [PR: Position-Conditional Bigram + Ablation] #1501 ,
which introduces the base architecture and proves FOMAML meta-TTT adds only
+0.00036 bpb in its original formulation.


TL;DR — Key Learnings for the Community

  1. TTT adaptation ceiling is architecture-limited. Three different meta-training
    objectives — same-batch FOMAML, no meta-training, and cross-chunk FOMAML with
    Δ-loss — all produce the same ~0.023 bpb TTT improvement. No meta-training
    objective can move this ceiling. To raise it, you need more adaptable parameters
    (more bank layers, LoRA-style correctors) or a better TTT optimizer (Adam,
    more epochs, higher LR).

  2. Three different training procedures find equidistant solutions in weight space
    with identical local curvature.
    Bank condition numbers (1.03–1.38), effective
    ranks (22 for attention, 11 for MLP), and energy distributions are identical
    across all three models. The loss landscape is degenerate: many equivalent
    minima exist, meta-TTT selects which one you land in, but the TTT adaptation
    surface looks the same from every minimum.

  3. MetaSGD per-layer LR learning needs a stronger signal. 66 learned per-bank-
    per-layer learning rate scales all converged to their 1.0 initialization. One
    meta-step every 4 training steps is too infrequent, and the meta-gradient is too
    weak relative to the main task gradient, to drive per-layer differentiation.

  4. Cross-chunk FOMAML is less disruptive than same-batch FOMAML. Subspace
    overlap analysis shows the no-meta model and cross-chunk model share 73%
    functional subspace, vs only 62% between the no-meta and same-batch models.
    The biased same-batch meta-gradient systematically rotates the MLP input
    subspace; the unbiased cross-chunk variant preserves it.

  5. Always measure the TTT delta, not just the final score. If we'd only
    compared final legal_ttt numbers, we might have concluded exp106's float-path
    1.11469 was better than exp101's 1.11588. But the delta tells the real story:
    exp106's better float baseline (1.1377) compensates for fewer training steps,
    while the TTT improvement itself is the same.

NOTE: exp101: Position-conditional bigram hashing — splits the 4096×64 hash table's 4095 buckets into exclusive word-start [0,2047) and within-word [2047,4094) halves keyed on has_leading_space[current_token], eliminating bucket contention that forced the parent model's word_start_boost gate to 0.007 → legal_ttt 1.11588 (zero extra params).

exp105a: Single-flag ablation (META_TTT_ENABLED=1→0, everything else byte-identical to exp101) proving FOMAML meta-TTT contributes only +0.00036 bpb at 3% compute overhead → legal_ttt 1.11624 — the meta-training was equivalent to gradient noise.


Disclaimer

  • Hardware: All runs use a single H100 80 GB SXM GPU with MAX_WALLCLOCK_SECONDS=4800
    (80-minute cap). This provides 4800 GPU-seconds of compute, matching the competition's
    standard 8×H100 @ 10 min budget at substantially lower cost.

  • Early stopping due to wallclock: exp106 completed 6686 of 7500 steps —
    ~11% fewer than the ablation (exp105a: 7226 steps). This is because MetaSGD's
    extra gradient storage (+8.6 GB peak memory) slowed each step from ~663 ms to
    ~718 ms, consuming the 80-minute budget faster. The model was still in the
    warmdown phase when stopped.

  • Int6 canonical eval crashed: After GPTQ quantization, eval_model.load_state_dict()
    failed with RuntimeError: Missing key(s): "meta_sgd_qo", "meta_sgd_kv", "meta_sgd_up", "meta_sgd_down" because the 66 MetaSGD parameters were correctly excluded from the
    16 MB export but GPT.__init__ still registers them. This meant the in-script int6
    roundtrip evaluation and canonical legal_ttt could not run. A hotfix was applied to
    the standalone ttt_from_checkpoint.py harness, which produced the float-path and
    partial int6 numbers reported here. Where int6 canonical values are unavailable, they
    are marked "—".

  • Non-record: This experiment is a non-record exploration (non_record: true). It
    exists to answer the question "can a better meta-TTT formulation move the TTT ceiling?"

  • Cost constraint: GPU time was limited. The partial int6 TTT run (80% complete) was
    terminated when the trajectory showed no convergence trend different from the baseline.
    Projected final value is ~1.118, consistent with the invariant ~0.023 delta.


Architecture Overview

Base Architecture

This experiment shares the identical architecture as PR #1501 (exp101). We reproduce
the full specification here for self-containment.
lots of inspiration from : PR #1019

Component Configuration What it does
Model 11-layer U-Net GPT 5 encoder blocks + 6 decoder blocks with skip connections between corresponding encoder-decoder pairs. Skip connections (additive residuals) help gradient flow and allow the decoder to reference early-layer representations.
Hidden dim 512 Width of the residual stream.
Attention 8Q / 4KV (GQA) Grouped-Query Attention: 8 query heads share 4 key-value heads (2:1 ratio). Halves KV param count with minimal quality loss.
MLP 3× expansion (1536) SwiGLU feed-forward network: 512 → 1536 → 512.
Vocabulary 1024 tokens SentencePiece BPE on fineweb10B.
Embeddings Tied (tok_emb = lm_head^T) Input embedding and output projection share weights.
RoPE Partial, 16/64 dims Rotary Position Embeddings on 25% of head dimensions.
XSA All 11 blocks Cross-layer Shared Attention: Q/K/V/O and MLP weights stored as banked 3D tensors shared across all layers (see PR #1501 for full explanation). The 4 banks (qo_bank, kv_bank, mlp_up_bank, mlp_down_bank) are the parameters adapted during TTT.
VE Layers 7–10 Value Embeddings: additional value projection on the last 4 layers.
Bigram 4096×64, position-conditional Hash-based bigram table with word-start/within-word bucket split (see PR #1501 for full explanation).
Total params 26,960,991 ~27M trainable parameters.

Training Pipeline

Component Configuration
Optimizer Muon (matrices) + AdamW (embeddings, scalars)
Schedule Cosine warmdown, adaptive trigger
EMA Decay 0.998
SWA Every 50 steps during warmdown
Late QAT Threshold 0.25
Batch 786,432 tokens (4× grad accumulation on 1 GPU)

Quantization and TTT

Same pipeline as PR #1501 : GPTQ int6 (attn+MLP) / int8 (embed) → LZMA → 16 MB.
TTT: SGD + cosine LR, momentum 0.9, 4 epochs, 947 chunks × 65K tokens.
Scoring: score-first-then-adapt (legal_ttt).


Innovation — What This PR Introduces

Motivation: Why Meta-TTT Needed a Redesign

PR #1501's ablation (exp105a) proved that exp101's FOMAML meta-TTT adds only +0.00036
bpb. But the concept of meta-TTT — training the model to adapt faster at test
time — is theoretically sound (MAML-style learning works in the meta-learning
literature). The failure had three identifiable structural causes:

Flaw What's wrong How it hurts
(A) Same-batch inner/outer Inner loop adapts on batch X, outer evaluates on batch X Meta-gradient rewards banks that resist SGD on seen data — the opposite of "generalize to unseen test chunks"
(B) No adaptation reward Outer loss = absolute L(banks'; X) A bank with low initial loss that gets worse under the inner step is rewarded equally as one that improves. No term explicitly rewards the improvement from adaptation.
(C) Uniform inner LR All 4 bank types × 11 layers use inner_lr = 0.002 The optimal adaptation speed for a shallow attention bank vs a deep MLP bank is likely different. No mechanism to learn this.

Innovation A: Cross-Chunk Split

Split the training batch B (shape [batch, seq_len]) along the batch dimension
into two halves. The first half provides the inner-loop adaptation data, the second
half provides the outer-loop evaluation data:

Inner: banks' ← banks − α · s ⊙ ∇L(banks; B_first_half)
Outer: L_meta = L(banks'; B_second_half)              ← DIFFERENT documents

Because the dataloader draws independent random sequences from fineweb10B, B_first_half
and B_second_half come from different documents. This matches the TTT deployment
regime: adapt on document i, get scored on document j.

Fallback: When per-GPU batch size = 1 (not our case, but handled), falls back
to sequence-half split (first/last 1024 tokens of the same sequence).

Innovation B: Delta-Loss Outer Objective

Instead of optimizing absolute post-adaptation loss, we add a term that explicitly
rewards the improvement from the inner step:

L_meta = (w_post + w_Δ) · L_post − w_Δ · L_pre

where:
  L_post = L(banks'; B_second_half)    ← loss AFTER adaptation
  L_pre  = L(banks;  B_second_half)    ← loss BEFORE adaptation (detached banks)
  w_post = 0.5  (META_TTT_LOSS_WEIGHT)
  w_Δ    = 0.3  (META_TTT_DELTA_WEIGHT)

Expanding: L_meta = 0.5 · L_post + 0.3 · (L_post − L_pre)

The second term is the adaptation delta: it directly penalizes banks where the
inner step makes things worse and rewards banks where it helps. A bank that starts
with low loss but doesn't improve gets penalized by the −w_Δ · L_pre term.

Cost: One extra forward pass per meta-step (computing L_pre).

Innovation C: MetaSGD — Learned Per-Layer LR Scales

For each bank type k ∈ {qo, kv, up, down} and each layer l:

banks'[k, l] = banks[k, l] − α · s[k, l] · ∇L(banks[k, l]; B_inner)

where s[k, l] ∈ R+ is a learned scalar initialized to 1.0. Shapes:

Parameter Shape Count Purpose
meta_sgd_qo (22,) 22 Per-slot LR scale for query-output bank
meta_sgd_kv (22,) 22 Per-slot LR scale for key-value bank
meta_sgd_up (11,) 11 Per-layer LR scale for MLP up-projection
meta_sgd_down (11,) 11 Per-layer LR scale for MLP down-projection
Total 66 Excluded from 16 MB export (0 bytes in submission)

If meta-TTT works, different layers should learn different scales — e.g., shallow
attention layers might need larger inner-loop steps than deep MLP layers. The
scales are registered as nn.Parameter and receive gradients via the outer loss
backprop. They are excluded from the exported final_model.pt and
final_model.int6.ptz to preserve the 16 MB budget.

Implementation detail: The inner-loop update is built as a differentiable
non-leaf tensor so a single backward pass populates both MetaSGD scale gradients
(via leaf autograd) and bank FOMAML gradients (via retain_grad + manual copy).


Results

exp106 — Meta-TTT Redesign

Metric Value Source Note
Steps completed 6686 / 7500 wallclock cap −334 vs exp101 (MetaSGD overhead)
val_bpb @ step 3000 1.2251 training log
val_bpb @ step 6000 1.1431 training log Best of the three at matched step
Post-EMA val_bpb 1.1377 training log Slightly worse than exp101 (fewer steps)
MetaSGD params exported 0 (66 excluded) by design
Int6 val_bpb (roundtrip) crashed meta_sgd_* strict-load RuntimeError
Model size (int6+lzma) 15.02 MB final artifact
Total submission size 15.14 MB model + code
Peak GPU memory 31,695 MiB training log +8.6 GB vs exp101 (MetaSGD gradients)
Float baseline bpb 1.13767 ttt_from_checkpoint_float_qatoff.log
Float-path legal_ttt 1.11469 ttt_from_checkpoint_float_qatoff.log
Float TTT delta −0.02299 computed
Int6 TTT (partial 80%) 1.11800 ttt_int6_ep4_partial.log (chunk 761/947)
Projected int6 legal_ttt ~1.118 trajectory extrapolation
Late QAT fired step 5110 training log
SWA started step 5300 training log

Int6 TTT Trajectory (partial, 80% complete)

Chunk progress bpb Source
401 / 947 (42%) 1.117622 ttt_int6_ep4_partial.log
621 / 947 (66%) 1.118994 ttt_int6_ep4_partial.log
661 / 947 (70%) 1.116769 ttt_int6_ep4_partial.log
681 / 947 (72%) 1.116469 ttt_int6_ep4_partial.log
761 / 947 (80%) 1.117976 ttt_int6_ep4_partial.log

Baseline (int6 canonical): 1.14160. Running delta at 80%: −0.02362.
The trajectory is flat in the 66–80% range. Projected final: ~1.118.

MetaSGD Scale Convergence

All 66 learned LR scales converged to values near their 1.0 initialization:

Parameter group Mean Std Min Max
meta_sgd_qo (22 scales) ~1.00 <0.04 >0.92 <1.08
meta_sgd_kv (22 scales) ~1.00 <0.04 >0.93 <1.07
meta_sgd_up (11 scales) ~1.00 <0.03 >0.94 <1.06
meta_sgd_down (11 scales) ~1.00 <0.03 >0.95 <1.05

Interpretation: No per-layer differentiation was learned. The meta-training
signal (1 meta-step per 4 training steps, at ~30% of main gradient magnitude)
is too weak to push 66 scalar parameters away from their initialization over
6686 training steps.


Analysis — Complete Meta-TTT Lineage (All Three Experiments)

This section summarizes the findings across all three experiments in this series.
A reader who sees only this PR should be able to understand the full meta-TTT story.

The Three Experiments

# Name Meta-TTT variant Architecture changes legal_ttt TTT delta
exp101 Record (PR #1501 ) FOMAML, same-batch inner/outer, every 4 steps Pos-conditional bigram, trigram, SGD+cosine TTT 1.11588 −0.02342
exp105a Ablation (PR #1501) Disabled (META_TTT_ENABLED=0) Identical to exp101 1.11624 −0.02331
exp106 Redesign (this PR) Cross-chunk + Δ-loss + MetaSGD, every 4 steps Identical to exp101 1.11469* −0.02299

*Float-path TTT; int6 canonical unavailable due to strict-load crash.

The Central Finding: TTT Delta Invariance

Experiment Baseline bpb Post-TTT bpb TTT delta Source
exp101 (FOMAML, int6) 1.13930 1.11588 −0.02342 logs_seed42.txt
exp105a (no meta, int6) 1.13956 1.11624 −0.02331 logs_seed42.txt
exp106 (redesign, float) 1.13767 1.11469 −0.02299 ttt_from_checkpoint_float_qatoff.log
exp106 (redesign, int6 80%) 1.14160 ~1.118 ~−0.024 ttt_int6_ep4_partial.log

The TTT delta is −0.023 ± 0.001 bpb across all variants. Three different
training objectives — from "no meta-signal" to "theoretically correct cross-document
generalization reward" — produce the same adaptation improvement.

Three-Way Weight-Space Analysis

We ran 8 analyses comparing all three models pairwise (script:
supporting_files/analysis_three_way.py, CPU-only, 3.6s on M2):

Triangle Geometry: Equidistant Solutions

                        exp101 (FOMAML)
                       /      \
              2336    /        \  2356         (bank L2 distances)
                     /          \
     exp105a (no meta) ──────── exp106 (redesign)
                        2324

All three models are approximately the same distance from each other. Meta-TTT
doesn't push you in a consistent direction — it pushes you to a random neighboring
basin, and the specific basin depends on the meta-gradient formulation.

Subspace Overlap: Cross-Chunk Preserves the Natural Subspace

Pair Avg subspace cosine Frac dims aligned
exp101 vs exp105a (FOMAML vs no-meta) 0.615 0.411
exp101 vs exp106 (FOMAML vs redesign) 0.659 0.472
exp105a vs exp106 (no-meta vs redesign) 0.727 0.548

The redesigned cross-chunk FOMAML (exp106) produces a solution closer in
functional subspace
to the no-meta baseline than the original same-batch
FOMAML (exp101) does. The biased same-batch meta-gradient rotates the subspace
more than the unbiased cross-chunk variant.

Most striking: mlp_up_bank subspace cosine is 0.949 between exp105a and
exp106 (nearly identical) but only 0.551 between exp101 and exp105a (half-
rotated). Same-batch FOMAML systematically distorts the MLP input features.

Error Surface: Identical Curvature at All Three Minima

Property exp101 exp105a exp106 Interpretation
Bank avg condition number 1.2 1.2 1.2 Near-isotropic — SGD works equally well in all directions
Bank avg effective rank 16.5 16.5 16.5 All bank dimensions contribute equally
Bank avg top-5 energy frac 0.37 0.37 0.37 Uniform energy distribution
Quantization MSE 8.686e-5 8.691e-5 8.686e-5 Identical sensitivity to int6

This is why the TTT delta is invariant: the local curvature of the loss
landscape — the surface that SGD navigates during TTT — is identical at all three
minima. SGD makes the same progress per step from any starting point.

Mode Connectivity: Three Distinct Basins

Pair Midpoint norm ratio Assessment
exp101 vs exp105a 0.786 Different basins
exp101 vs exp106 0.793 Different basins
exp105a vs exp106 0.807 Borderline same basin
3-way centroid 0.704 Clearly distinct (30% norm loss)

The three models occupy distinct local minima. exp105a and exp106 are closest
to being in the same basin (ratio 0.807, threshold ~0.8), consistent with
cross-chunk FOMAML being less disruptive than same-batch FOMAML.

Why Meta-TTT Cannot Move the Ceiling

The argument from curvature invariance: TTT improvement depends on (1) how
far SGD can move the banks in 4 epochs (fixed by TTT config) and (2) how much
loss reduction each step buys (determined by local curvature). We showed the
curvature is identical at all three minima. Therefore the TTT delta must be
identical — QED.

The argument from over-parameterization: The training loss surface has a
degenerate set of equivalent minima (the three models prove this). Meta-TTT
selects a different minimum but cannot escape the set. All minima in the set
have the same curvature and the same TTT potential. To escape, you'd need a
stronger perturbation: second-order MAML, many more inner steps, or a dedicated
meta-training phase after warmdown.

The argument from MetaSGD: If per-layer LR differentiation could help, the
66 MetaSGD scales should have diverged from their 1.0 initialization. They
didn't. The meta-gradient signal at 1 step per 4, with loss weight 0.5, is
too weak to drive 66 scalar parameters in 6686 training steps.



Learnings for the Community

  1. The TTT adaptation ceiling is set by architecture, not initialization.
    ~0.023 bpb is invariant across three FOMAML variants (same-batch, none,
    cross-chunk + Δ-loss + MetaSGD). To improve TTT, change the bank dimensionality
    or the TTT optimizer — not the training-time meta-objective.

  2. First-order MAML with 1 inner step on a well-trained model ≈ gradient noise.
    After 6000+ training steps, the banks are near a local optimum. A single inner
    SGD step barely perturbs them, so the FOMAML outer gradient carries near-zero
    functional signal regardless of how the inner/outer data is split.

  3. Cross-chunk FOMAML is less harmful than same-batch FOMAML (even though both
    are useless for TTT). Same-batch FOMAML introduces a systematic directional
    bias that rotates the MLP input subspace 45° from the natural optimum. Cross-
    chunk FOMAML's unbiased meta-gradient preserves the natural subspace (cos 0.95).

  4. MetaSGD needs a stronger signal to learn meaningful per-layer differentiation.
    At 1 meta-step per 4 training steps with loss weight 0.5, the effective meta-
    gradient energy is ~7.5% of total gradient. This is insufficient to drive 66
    scalar parameters away from their initialization over 6686 steps.

  5. Three equivalent minima with identical local curvature — the loss landscape
    of a Muon-trained 27M-param transformer has a degenerate set of solutions.
    Meta-learning perturbations select among them but cannot improve them. This
    is consistent with overparameterization theory and with empirical results from
    lottery ticket and mode connectivity research.

  6. Measure the delta, not the score. If we'd only compared final bpb numbers,
    exp106's 1.11469 looks better than exp101's 1.11588. But the TTT delta
    (architecture-level metric) is the same. The per-experiment score difference comes
    from different pre-TTT baselines (1.1377 vs 1.1393), which are driven by the
    number of training steps completed, not by meta-TTT quality.


Related PRs


Folder Structure

pr2_metattt_redesign/
├── pull_summary.md                          ← this file
├── experiment_exp106/                       ← META-TTT REDESIGN (non-record)
│   ├── train_gpt.py                         ← full training script with A+B+C (123K)
│   ├── submission.json                      ← metadata + results
│   ├── logs_seed42.txt                      ← condensed training metrics
│   ├── training_stdout_seed42.txt           ← full training stdout (128K)
│   └── supporting_files/
│       ├── README.md                        ← detailed experiment writeup
│       ├── run.sh                           ← training config (META_TTT_SPLIT=batch, etc.)
│       ├── Inference.ipynb                  ← model loading + eval + TTT visualization
│       ├── save_model.py                    ← checkpoint export (meta_sgd exclusion)
│       ├── ttt_eval.py                      ← TTT evaluation harness
│       ├── ttt_from_checkpoint.py           ← standalone TTT eval (hotfixed for meta_sgd)
│       ├── ttt_from_checkpoint.log          ← int6 canonical TTT attempt
│       ├── ttt_from_checkpoint_float_qatoff.log  ← complete float-path TTT run
│       ├── ttt_int6_ep4_partial.log         ← partial int6 TTT (80% complete)
│       ├── requant_mixed_precision.py       ← mixed int6/int7 attempt (over budget)
│       ├── requant_mixed_v1.log             ← mixed-precision log
│       ├── ERROR_SURFACE_ANALYSIS.md        ← three-way error surface geometry study
│       ├── META_TTT_ANALYSIS.md             ← two-way weight-space analysis (exp101 vs exp105a)
│       ├── analysis_three_way.py            ← three-way analysis script (8 analyses, 3.6s)
│       ├── analysis_three_way.json          ← three-way numerical results
│       ├── analysis_meta_ttt.py             ← two-way analysis script (5 analyses, 1.3s)
│       └── analysis_meta_ttt.json           ← two-way numerical results

@SPThole SPThole changed the title [Non Record] Meta-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD [Non Record] Learn to Learn: Meta-Learning-TTT Redesign — Cross-Chunk FOMAML + Delta-Loss + MetaSGD Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant