6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026 by alphastar1111 · Pull Request #1527 · openai/parameter-golf

alphastar1111 · 2026-04-10T16:09:59Z

Summary

6-layer transformer (3 encoder + 3 decoder) matches the 9L baseline with 33% fewer layers and fewer
parameters
Sliding window val_bpb: 1.2026 (post-quant standard: 1.2246 vs baseline 1.2244)
Artifact size: 15.84 MB (under 16MB)
Estimated 8xH100 runtime: ~8.2 min (trained/validated on 1xA100, calibrated via baseline step_avg ratio
705.97ms/43.54ms = 16.2x)

Approach

Five architectural bets compensate for extreme depth reduction:

Full attention (4H/4KV, no GQA)
Untied embeddings
Half batch (262K tokens, grad_accum=2) — 2x more optimizer steps per token
Tight logit softcap (12.0 vs 30.0) — implicit regularization for shallow models
Long context (seq_len=2048) — double the baseline's 1024

Credits

LeakyReLU^2 (slope=0.5) — from PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 by @abaybektursun
Sliding window eval (stride=64) — from @mattqlf's
records/track_10min_16mb/2026-03-19_SlidingWindowEval/README.md submission
seq_len=2048 — first explored by @spokane Way in
records/track_10min_16mb/2026-03-18_LongContextSeq2048/README.md
U-Net skip connections — widely adopted across leaderboard (PR 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198 and derivatives)
MLP 3x expansion — standard from PR 11-Layer Int6 + WD=0.04 + SWA + FA3 (val_bpb: 1.1318) #198 / @raahil Shah's Int6 MLP3x submission
Muon optimizer — from the modded-nanogpt baseline

Hardware details

Trained on 1xA100 80GB (not 8xH100). Calibration math and caveats about small-batch GPU inefficiency are
detailed in the README. MAX_WALLCLOCK_SECONDS=600 warmdown handles graceful cutoff if timing estimate is
off.

Test plan

Full 20K step training run on 1xA100
Int8+zlib roundtrip validation passes
Sliding window eval (stride=64) completes
Artifact under 16MB (15,841,206 bytes)
Code under 1500 lines (1,325 lines)
Verify on 8xH100 (extrapolated from A100 calibration)

MatoTeziTanka · 2026-04-11T20:11:49Z

Community Review — 6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026

BPB: 1.2026 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 27c66e4125ee, file records/track_10min_16mb/2026-04-10_6L_DepthMinimalism_UNet_SlidingWindow/train_gpt_26e6b4a.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=6, vocab=1024, code=56975 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=6, vocab=1024, code=56975 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

alphastar1111 added 2 commits April 10, 2026 14:15

MAINT: Initial changes for parameter-golf 6L model

74e6451

MAINT: Version changes

27c66e4

alphastar1111 changed the title ~~Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2025~~ Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2026 Apr 11, 2026

alphastar1111 changed the title ~~Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2026~~ 6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026 Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026#1527

6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026#1527
alphastar1111 wants to merge 2 commits intoopenai:mainfrom
alphastar1111:alphastar1111/record_6L_depthMinimalism_UNet_slidingWindow

alphastar1111 commented Apr 10, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alphastar1111 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Credits

Hardware details

Test plan

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — 6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alphastar1111 commented Apr 10, 2026 •

edited

Loading