Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean) by ndokutovich · Pull Request #1487 · openai/parameter-golf

ndokutovich · 2026-04-09T01:45:11Z

Record: SP8192 + Full Stack + Tuned Pre-Quant TTT

val_bpb = 1.0600 (3-seed mean, std 0.0002) | ~15.95 MB | 8xH100 SXM

3-Seed Results

Seed	Sliding BPB	Steps	Artifact
42	1.06023436	5161	15,954,437
1337	1.05980538	5174	15,954,178
2024	1.06010381	5164	15,960,801
Mean	1.06004785

What Changed vs PR #1485

Hyperparameter tuning on pre-quant TTT:

Parameter	PR #1485	This PR
QK_GAIN_INIT	5.0	5.25
TTT_EPOCHS	6	10
TTT_FREEZE_BLOCKS	2	1
TTT_LR	0.0005	0.00045
3-seed mean	1.0679	1.0600

Same architecture, same code, different env vars. Delta: -0.0079 BPB.

Full Stack

SP8192, 11L/13 virtual (3-layer depth recurrence), parallel residuals (L7+), EMA 0.9965, QK-Gain 5.25, skip gates, MuonEq-R, pre-quant AdamW TTT (10ep, lr=0.00045, freeze 1, cosine), SDClip GPTQ int6 + int8 embed + brotli.

Compliance (Track A)

Pre-quant TTT on val data BEFORE quantization, baked into artifact
No eval-time adaptation, no SLOT, no n-gram cache

Reproduction

pip install brotli sentencepiece kernels
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
VOCAB_SIZE=8192 QK_GAIN_INIT=5.25 PREQUANT_TTT_EPOCHS=10 PREQUANT_TTT_FREEZE_BLOCKS=1 PREQUANT_TTT_LR=0.00045 SEED=42 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1471 @X-Abhishek-X, PR #1423 @aryanbhosale, PR #1394 @clarkkev, PR #1204 @msisovic, PR #1482 @aamodbhatt

Checklist

One folder under records/track_10min_16mb/
README.md, submission.json, train_gpt.py
3 seed logs
All artifacts < 16,000,000 bytes
Train wallclock < 600s

…— val_bpb 1.0600 (3-seed mean) Tuned variant with QK-Gain 5.25, 10-epoch TTT (lr=0.00045, freeze 1 block). seed 42: 1.06023436 seed 1337: 1.05980538 seed 2024: 1.06010381 mean: 1.06004785 (std 0.0002)

…nthesis (validation pending) First submission to stack three independently-legal val-data adaptations on the PR openai#1487 (1.0600) base: 1. Pre-Quant AdamW TTT pushed to 11 epochs with freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X computed from validation activations to align quantization with the eval distribution (novel on the modern stack; PR openai#1019 ablated this on its older base only) 3. Eval-Time Legal Score-First TTT 2 epochs with score-before-update ordering (Track B, builds on PR openai#1493) The three knobs attack the 0.0187 BPB quantization gap measured in PR openai#1487 (1.0415 post-prequant-TTT FP -> 1.0602 post-quant sliding) from independent angles. PR openai#1487's eval_val_ttt code path is unchanged but enabled via env vars. Code diff vs PR openai#1487 base: 186 lines (~100 added in new collect_hessians_val function, plus 8 hyperparameter defaults flipped). Architecture, optimizer, training loop, EMA, and quantization machinery are byte-identical to PR openai#1487. Projected val_bpb range: 1.0452 - 1.0542 (center 1.0497), which would clear the 0.005-nat SOTA threshold over PR openai#1487. Worst case ~1.054 (still strong non-record). py_compile clean. 3-seed validation requires ~$15-25 of 8xH100 SXM time on RunPod; see VALIDATION.md. Compliance: Track A (artifact-baked val-data adaptation) + Track B (eval-time score-first TTT). No SLOT, no n-gram cache, no ETLB. Credits: PR openai#1487 ndokutovich, PR openai#1493 bigbag, PR openai#1019 abaybektursun, PR openai#1394 clarkkev, PR openai#1413 dexhunter, PR openai#549 abaybektursun, PR openai#1412 Robby955, PR openai#1204 msisovic, PR openai#1423 aryanbhosale, PR openai#1445 X-Abhishek-X.

…ib GPTQ + SLOT-24 Replaces the triple-stack (Pre-Quant TTT + Val-Calib GPTQ + Eval-Time Legal TTT) with a quad-stack that supersedes the legal TTT path with SLOT-24, ported from PR openai#1488 / PR openai#1313. Four val-data adaptations stacked for the first time: 1. Pre-Quant AdamW TTT — 11 epochs, freeze_blocks=0 (Track A) 2. Val-Calibrated GPTQ — Hessian H=X^T X from val activations (Track A) 3. SLOT-24 — per-window hidden delta + logit bias on the frozen post-quant model, 24 cosine-decayed AdamW steps, throwaway parameters 4. (Optional) Eval-Time Legal Score-First TTT — disabled by default; SLOT supersedes it within the eval budget. Set SLOT_ENABLED=0 TTT_ENABLED=1 to fall back. Code changes vs the previous synthesis commit: - GPT class: split forward_logits into forward_hidden + compute_logits so SLOT can add the per-window delta to the hidden state without re-running the transformer stack. - New eval_val_slot function ported from PR openai#1488 (per-window AdamW with cosine LR decay, stride masking, score-after-delta). - run_evals: wires SLOT on a fresh post-quant model copy, gated by SLOT_ENABLED. Disables legal TTT by default. - New hyperparameters: SLOT_ENABLED, SLOT_STEPS, SLOT_LR, SLOT_LR_MIN, SLOT_BATCH_SEQS, SLOT_EVAL_STRIDE. Folder renamed: 2026-04-09_PreQuantTTT11_ValCalibGPTQ_LegalEvalTTT_Synthesis -> 2026-04-09_PreQuantTTT11_ValCalibGPTQ_SLOT24_Quad_Synthesis Time budget: ~530s of 600s eval used (590s train + 190s prequant TTT + 10s val-calib GPTQ + 80s sliding eval baseline + 250s SLOT-24). Code: 2322 lines (vs 2039 in PR openai#1487 base, +283 added). py_compile clean. README rewritten as user's submission with compact credits section.

ndokutovich · 2026-04-10T00:33:47Z

Closing as invalid. Same prequant_ttt_adapt_adamw implementation as #1485, which violates Condition 3 of #1017 (score-before-update). Full technical analysis in #1485. Will reimplement with per-chunk score-first pattern from #1413 / #549 before any future submission.

ndokutovich · 2026-04-10T10:53:31Z

Reopening this PR. When it was submitted, I closed it after @dexhunter raised a valid concern about Condition 3 compliance of the pre-quant TTT pattern (training on val data before quantization). I agreed the interpretation was unclear and closed proactively.

Since then, PR #1517 has been submitted with the same pre-quant TTT approach (18 epochs). Reopening this PR pending official clarification on whether pre-quant TTT is legal under Issue #1017. If the ruling is that it violates Condition 3, I'll close again immediately.

Result: val_bpb 1.0600 (3-seed mean), same architectural stack as described in the original submission.

Best config: warmdown_frac=0.667, recur_start_step=3000, TTT 22ep lr=2.5e-4 H100 3-seed: 1.06248, 1.06267, 1.06267 (mean 1.06261) H200 3-seed: 1.05781, 1.05831, 1.05891 (mean 1.05834) H200 result beats SOTA openai#1487 (1.0600) by 0.0017 bpb. H100 result 1.0626 is close but not matching due to step speed difference. Made-with: Cursor

MatoTeziTanka · 2026-04-11T16:59:09Z

Community review — PR #1487 (ndokutovich)

Hi @ndokutovich — thank you for the clean write-up and for flagging the Condition 3 uncertainty yourself; reopening proactively after #1517 landed is the right move, and I appreciate how transparently you handled the self-close/reopen. Sharing a community read while the official ruling is pending.

Gauntlet (CT2038 proteus-engine, 2026-04-11)

[PASS] Import (12.3s)
[PASS] Hyperparameters: dim=512, layers=11, heads=8, vocab=4096
Subsequent checks (architecture build / forward pass) did not complete within the 300s CPU budget — expected for SP8192 + 13-virtual-layer depth recurrence on a single CPU and consistent with other SP8192 frontier PRs in this cluster. No code errors observed up to the timeout.

Pre-Quant TTT audit (the compliance question)

This is the core question for the whole SP8192 + Pre-Quant TTT cluster, so I want to be precise about what the code does at SHA b6a1fe8.

Call site — train_and_eval, lines 1956–1966:

# Pre-quant AdamW TTT: adapt EMA model on val data BEFORE GPTQ (ported from #1423)
if h.prequant_ttt_enabled:
    ...
    prequant_ttt_adapt_adamw(
        h, base_model, device, val_data.val_tokens,
        rank=h.rank, world_size=h.world_size,
    )

Loop body — prequant_ttt_adapt_adamw, lines 1289–1349. The relevant structure:

Line 1305: optimizer = torch.optim.AdamW(ttt_params, lr=h.prequant_ttt_lr, weight_decay=0.0)
Lines 1307–1309: cosine LR schedule with T_max=h.prequant_ttt_epochs
Line 1314: for epoch in range(h.prequant_ttt_epochs): — 10 epochs per this PR's env vars
Lines 1317–1325: iterates every batch_seqs-sized chunk of val_tokens, building (x, y) as next-token pairs
Lines 1327–1329: loss = base_model(x, y); loss.backward() — standard supervised cross-entropy on val
Line 1335: optimizer.step() — update weights, then loop
Lines 1346–1348: unfreeze, switch to eval

I want to line this up against the two reference patterns Issue #677 (valerio-oai) distinguishes:

Legal pre-quant TTT (Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416 erichroepke, Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423 aryanbhosale ~1.079) — score-first, single-pass over the target stream: each token is scored by the current model before any gradient update derived from that token, and the whole pass is made exactly once. This satisfies Condition 3 of Issue A Field Guide to Valid Submissions #1017 / the "score before update" requirement of Issues Invalid submissions due to information leakage during TTT #402 and Illegal submissions megathread #677.
Illegal pre-quant TTT (Record: SLOT-24 + Pre-quant TTT — val_bpb 0.7094 (3-seed mean) #1376 stukenov, closed today) — a multi-epoch supervised fine-tune on val_tokens with no per-token scoring discipline: epoch 2+ updates on tokens the model has already been updated against in epoch 1.

Could you help me reconcile the # ported from #1423 comment at line 1293 against the actual loop structure?

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423 runs a single pass where each (x, y) chunk is scored into the sliding-BPB accumulator before optimizer.step() touches those weights.
The loop here at 1314–1335 has no scoring accumulator — eval_val is only called after the whole 10-epoch loop finishes (line 1966, timed_eval("post-prequant-ttt", ...)), so by that point every token in val_tokens has contributed to 10 gradient updates that subsequent epochs also consume.
Per Issue Illegal submissions megathread #677, multi-epoch TTT that scores only on the final pass is explicitly what's flagged as invalid; per Issue Invalid submissions due to information leakage during TTT #402 and Illegal submissions megathread #677, TTT must score each token before adapting on it.

On my read, the loop matches the #1376 pattern structurally rather than the #1423 pattern, regardless of the code-comment provenance. Is there scoring discipline elsewhere in the pre-quant path that I've missed? If not, this would appear to be in the same bucket as #1376 (closed) and #1485 (which you also closed earlier today for the same reason).

The delta in the table vs PR #1485 (6ep → 10ep pre-quant training, −0.0079 BPB) is consistent with that reading: more supervised epochs on val_tokens yielding more val improvement is what you'd expect from the illegal pattern, and it's why Issue #677 treats multi-epoch-on-val as the bright line.

Architecture credits

The credits block (SP8192 #1471, QK-Gain #1423, parallel residuals #1394, depth-recurrence #1204, TTT tuning #1482) is clean attribution and matches the "Recur345 / Par7" lineage I've seen across this cluster — thanks for the explicit links.

Verdict

Code quality: high. Loop is well-structured, DDP-correct (dist.all_reduce on grads and on the loss accumulator), freezes early blocks cleanly, honors the checklist, and ships 3 seeds at std 0.0002.

Compliance read (community, not official): on the code at b6a1fe8, prequant_ttt_adapt_adamw is a 10-epoch supervised fine-tune on val_tokens with no score-before-update discipline. That looks to me like the Issue #677 / #1376 illegal pattern rather than the #1416 / #1423 legal pattern, despite the code comment. I'd like to be wrong about this — if the scoring step lives somewhere I missed, please point me at the lines.

Recommendation

Holding non-binding pending the official clarification you're already waiting on. If the maintainers rule that pre-quant TTT must be score-first single-pass (Condition 3 strict read), this one would need the same rework path you outlined in your earlier self-close comment — porting the per-chunk score-first pattern from #1413 / #549 — and the #1485 → #1487 delta would likely compress significantly. If the ruling instead explicitly permits multi-epoch pre-quant adaptation on val, then the 1.0600 3-seed mean is a real result on a clean architecture and I'd happily flip to support.

Either way, thank you for the transparency on the compliance question and for labeling the tuning delta so clearly against #1485 — it makes this exact audit much easier than it usually is.

Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet (CT2038 proteus-engine, 2026-04-11): Import PASS, Hyperparameters PASS (dim=512, L=11, H=8, vocab=4096), subsequent stages hit the 300s CPU budget (normal for SP8192 + depth recurrence). AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA b6a1fe87725b2a8a616009a4897b64eabce3e212.

Key finding: reducing GPTQ clip threshold from default sigma=12.85 to 10.0 reduces quantization gap from 0.043 to 0.024 bpb, yielding massive improvement. H200 3-seed: 1.0490, 1.0507, 1.0489 (mean 1.0495) Beats SOTA openai#1487 (1.0600) by 0.0105 bpb = 0.0073 nats H100 validation jobs submitted. Made-with: Cursor

Training now stops at 590s (600s - 10s reserve), leaving time for GPTQ compression to complete within the total budget. Matches the pattern from PR openai#1487 (gptq_reserve_seconds=10).

…seed) Key finding: reducing GPTQ SDClip sigma from 12.85 to 9.5 cuts the quantization gap by ~45% (0.043 → 0.024 bpb). H100 3-seed: 1.05252, 1.05280, 1.05280 (mean 1.05270) Beats SOTA openai#1487 (1.0600) by 0.0073 bpb = 0.0051 nats (>0.005 threshold) All artifacts under 16MB (max 15.94MB) Config: MATRIX_CLIP_SIGMAS=9.5 MATRIX_LR=0.020 WARMDOWN_FRAC=0.667 RECUR_LAYERS=3,4,5 RECUR_START_STEP=3000 TTT_EPOCHS=22 TTT_LR=0.00025 Made-with: Cursor

- Updated README to match actual config (22ep TTT, sdclip=9.5, 1.0527 bpb) - Fixed discrepancy between title (18ep) and actual logs (22ep) - Clarified Pre-Quant TTT approach follows PR openai#1482/openai#1487 precedent Made-with: Cursor

MatoTeziTanka · 2026-04-12T04:50:52Z

Community Review — Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)

BPB: 1.0600 | Compliance: FLAG — Pre-Quant TTT runs multi-epoch on val_tokens with no score-first discipline

What I found in the code (head SHA b6a1fe87725b, file records/track_10min_16mb/2026-04-09_SP8192_Recur345_Par7_EMA_QK525_PreQuantTTT10/train_gpt.py):

At line 1289 the pre-quant TTT function takes val_tokens as an input argument and runs an epoch loop over it with loss.backward()/optimizer.step(), with no prior torch.no_grad() scoring pass over the same tokens:

prequant_ttt_adapt_adamw(h, base_model, device, val_tokens, rank, world_size) — for epoch in range(h.prequant_ttt_epochs), loss.backward() without prior no_grad score pass

Per Issue #402 and Issue #677 (@valerio-oai, 2026-03-27), TTT is valid only if each token is scored BEFORE the adapter trains on it; multi-epoch TTT that scores only on the final pass is explicitly called out as invalid. This implementation matches the pattern that closed PR #1376 (stukenov) and was subsequently confirmed in #1485/#1487/#1488/#1489/#1517/#1539 — see Issue #677 meta-comment from 2026-04-11 which lists the 6+ PRs in the cluster.

Contrast with the legal score-first-per-chunk TTT pattern (e.g. PR #1413 dexhunter, the current leaderboard entry at 1.0828): that implementation scores each chunk under torch.no_grad() into the sliding-BPB accumulator before optimizer.step() adapts the model on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. The distinction is the per-chunk score-first discipline — no token is seen by the optimizer before it's scored.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 5.57s, dim=512, layers=11, vocab=4096, code=87565 B, SMOKE_TEST_PASS

Verdict: COMPLIANCE FLAG — same pattern as the closed Pre-Quant TTT cluster.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #1376 and the rest of the cluster. A resubmission that adopts the score-first-per-chunk pattern (per PR #1413 dexhunter, the current 1.0828 leaderboard entry) — scoring each chunk under torch.no_grad() before optimizer.step() adapts on it — would be welcomed.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 5.57s, dim=512, layers=11, vocab=4096, code=87565 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

ndokutovich · 2026-04-15T00:36:22Z

Closing alongside #1488 — same prequant_ttt_adapt_adamw pattern, same C3 violation of Issue #1017, same #1376 cluster ruling. Confirmed by @MatoTeziTanka static audit and @Bortlesboat AST checker independently. Thanks for the reviews.

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep …

b6a1fe8

…— val_bpb 1.0600 (3-seed mean) Tuned variant with QK-Gain 5.25, 10-epoch TTT (lr=0.00045, freeze 1 block). seed 42: 1.06023436 seed 1337: 1.05980538 seed 2024: 1.06010381 mean: 1.06004785 (std 0.0002)

owizdom mentioned this pull request Apr 9, 2026

Non-record: Pre-Quant TTT 11ep + Val-Calibrated GPTQ + SLOT-24 — quad-stack synthesis (validation pending compute) #1498

Open

7 tasks

ndokutovich mentioned this pull request Apr 10, 2026

Record: SP8192 + 3-Layer Depth Recurrence + Parallel Residuals + EMA + QK5 + Pre-Quant AdamW TTT — val_bpb 1.0679 (3-seed mean) #1485

Closed

7 tasks

ndokutovich closed this Apr 10, 2026

ndokutovich reopened this Apr 10, 2026

ndokutovich mentioned this pull request Apr 10, 2026

Record: SP1024 + SLOT-24 + QK5.25 + Pre-Quant AdamW TTT — val_bpb 0.8265 (3-seed mean) #1488

Closed

5 tasks

ndokutovich closed this Apr 15, 2026

Bortlesboat mentioned this pull request Apr 15, 2026

Record: 11L LeakyReLU² XSA-all GPTQ-AR SLOT64 — 0.6951 BPB #1319

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487

Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)#1487
ndokutovich wants to merge 1 commit intoopenai:mainfrom
ndokutovich:s4-h3-submission

ndokutovich commented Apr 9, 2026

Uh oh!

ndokutovich commented Apr 10, 2026

Uh oh!

ndokutovich commented Apr 10, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

ndokutovich commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ndokutovich commented Apr 9, 2026

Record: SP8192 + Full Stack + Tuned Pre-Quant TTT

3-Seed Results

What Changed vs PR #1485

Full Stack

Compliance (Track A)

Reproduction

Credits

Checklist

Uh oh!

ndokutovich commented Apr 10, 2026

Uh oh!

ndokutovich commented Apr 10, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community review — PR #1487 (ndokutovich)

Gauntlet (CT2038 proteus-engine, 2026-04-11)

Pre-Quant TTT audit (the compliance question)

Architecture credits

Verdict

Recommendation

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: SP8192 + Recur345 + Par7 + EMA + QK5.25 + Pre-Quant TTT 10ep — val_bpb 1.0600 (3-seed mean)

Uh oh!

ndokutovich commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants