Skip to content

6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026#1527

Open
alphastar1111 wants to merge 2 commits intoopenai:mainfrom
alphastar1111:alphastar1111/record_6L_depthMinimalism_UNet_slidingWindow
Open

6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026#1527
alphastar1111 wants to merge 2 commits intoopenai:mainfrom
alphastar1111:alphastar1111/record_6L_depthMinimalism_UNet_slidingWindow

Conversation

@alphastar1111
Copy link
Copy Markdown

@alphastar1111 alphastar1111 commented Apr 10, 2026

Summary

  • 6-layer transformer (3 encoder + 3 decoder) matches the 9L baseline with 33% fewer layers and fewer
    parameters
  • Sliding window val_bpb: 1.2026 (post-quant standard: 1.2246 vs baseline 1.2244)
  • Artifact size: 15.84 MB (under 16MB)
  • Estimated 8xH100 runtime: ~8.2 min (trained/validated on 1xA100, calibrated via baseline step_avg ratio
    705.97ms/43.54ms = 16.2x)

Approach

Five architectural bets compensate for extreme depth reduction:

  1. Full attention (4H/4KV, no GQA)
  2. Untied embeddings
  3. Half batch (262K tokens, grad_accum=2) — 2x more optimizer steps per token
  4. Tight logit softcap (12.0 vs 30.0) — implicit regularization for shallow models
  5. Long context (seq_len=2048) — double the baseline's 1024

Credits

Hardware details

Trained on 1xA100 80GB (not 8xH100). Calibration math and caveats about small-batch GPU inefficiency are
detailed in the README. MAX_WALLCLOCK_SECONDS=600 warmdown handles graceful cutoff if timing estimate is
off.

Test plan

  • Full 20K step training run on 1xA100
  • Int8+zlib roundtrip validation passes
  • Sliding window eval (stride=64) completes
  • Artifact under 16MB (15,841,206 bytes)
  • Code under 1500 lines (1,325 lines)
  • Verify on 8xH100 (extrapolated from A100 calibration)

@alphastar1111 alphastar1111 changed the title Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2025 Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2026 Apr 11, 2026
@alphastar1111 alphastar1111 changed the title Record: 6L depth minimalism U-Net sliding window - val_bpb 1.2026 6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026 Apr 11, 2026
@MatoTeziTanka
Copy link
Copy Markdown

Community Review — 6L Depth Minimalism + U-Net + Sliding Window — val_bpb 1.2026

BPB: 1.2026 | Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

What I found in the code (head SHA 27c66e4125ee, file records/track_10min_16mb/2026-04-10_6L_DepthMinimalism_UNet_SlidingWindow/train_gpt_26e6b4a.py):

Static code review found no TTT adaptation function, no SLOT optimization loop, no n-gram-cache class, and no pre-quant val-token fine-tune. The eval path uses the standard sliding-window stride-64 pattern. The submission is a pure-neural architecture iteration on the standard SP1024/SP4096/SP8192 baseline.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=6, vocab=1024, code=56975 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the classification pass — this looks like a clean pure-neural iteration on the standard baseline.

Auto-classification caveat: this review was drafted by the AST-based classifier. If there's a non-standard eval mechanism (logit postprocessing, hedge mixing, etc.) that I missed because it's factored into a helper file or a non-standard function name, please flag it and I'll re-run the audit manually.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.03s, dim=512, layers=6, vocab=1024, code=56975 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants