Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits by shram86 · Pull Request #1495 · openai/parameter-golf

shram86 · 2026-04-09T08:39:51Z

This PR adds a non-record submission under:

records/track_non_record_16mb/2026-04-09_12L_24min_Vocab1792_FlashMuon_LinearScaleInit_XSA5LastGated_RReLU2_Int6AWQMixedBits

Final result:

val_bpb: 1.10768987
val_loss: 2.15009312
total submission size: 18,629,446 bytes

Submission summary:

12-layer practice branch
24 minutes of training
large-vocabulary configuration
Flash Muon
XSA enabled on the last 5 layers, with only the final XSA layer gated
linear-by-depth initialization for attn_scale and mlp_scale
RReLU2 MLP
late EMA plus post-train best-choice selection
mixed-bit int6 AWQ + lzma export
val-tail calibration

Mixed-bit quantization details:

default export stays int6 for most tensors
selected sensitive tensors are promoted to int8 through QUANT_INT8_NAME_PATTERNS
default sensitive tensor list:
- tok_emb.weight
- lm_head.weight

Notes:

this is a practice/non-record submission intended to demonstrate that the branch continues to improve with more depth and longer training, not only under the strict challenge budget
the measured result reaches about 1.10 bpb in this larger and longer setting

…eLU2 Int6AWQ

…Muon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

MatoTeziTanka · 2026-04-12T05:10:20Z

Community Review — Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Both train_gpt.py files in this PR (Apr 7 and Apr 9 entries by shram86) are clean.

N-gram family bug check — CLEAR
No usage of ctx_hash, full_key, primes, or any target-XOR hash lookup was found in either file. The only XOR-like operator (& / ~) appears in byte-count LUT lookups for computing bits-per-byte metrics (base_bytes_lut, has_leading_space_lut, is_boundary_token_lut), which are purely metrological and involve no target leakage into any input hash key. LEGAL.

Pre-Quant TTT / val_tokens optimizer step check — CLEAR
val_tokens is used in exactly three contexts:

eval_val() (line 413) — wrapped in torch.inference_mode(), no gradients, read-only scoring. LEGAL.
sliding_window_evaluation() (line 475) — wrapped in torch.inference_mode(), no gradients. LEGAL.
sample_validation_calibration_sequences_tail() (line 642) → feeds collect_activation_stats() (line 652) — wrapped in torch.inference_mode(), used only to collect forward-pass activation statistics for AWQ column scaling during post-training quantization. No backward() call, no optimizer.step(). LEGAL.

All optimizer steps (opt.step() at lines 1485, 1588) operate exclusively on train-split data from train_loader.next_batch(). No multi-epoch AdamW loop on val_tokens exists anywhere.

Scored-region SLOT check — CLEAR
No masking + optimizing + scoring the same region was found.

Architecture
Both submissions are pure transformer variants (XSA attention on last 5 layers, RReLU2 MLP, Flash Muon optimizer, EMA + best-checkpoint selection, int6 AWQ+lzma post-training quantization). The Apr 9 entry extends this with mixed-bit quantization (int8 for embeddings/LM head via QUANT_INT8_NAME_PATTERNS). No exotic test-time adaptation of any kind.

Files reviewed: Both train_gpt.py at HEAD SHA 4eb5d777f74930c39ad9924c2df152b81c0f9f43.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

shram86 added 3 commits April 7, 2026 20:51

Add non-record submission: FlashMuon LinearScaleInit XSA5LastGated RR…

6c1f606

…eLU2 Int6AWQ

Add non-record submission: 12L 24min of training Vocabulary1792 Flash…

d90ce54

…Muon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

Rename submission folder to Vocab1792

4eb5d77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits#1495

Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits#1495
shram86 wants to merge 3 commits intoopenai:mainfrom
shram86:my-submission-2026-04-09-clean

shram86 commented Apr 9, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shram86 commented Apr 9, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants