Skip to content

Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits#1495

Open
shram86 wants to merge 3 commits intoopenai:mainfrom
shram86:my-submission-2026-04-09-clean
Open

Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits#1495
shram86 wants to merge 3 commits intoopenai:mainfrom
shram86:my-submission-2026-04-09-clean

Conversation

@shram86
Copy link
Copy Markdown

@shram86 shram86 commented Apr 9, 2026

This PR adds a non-record submission under:

records/track_non_record_16mb/2026-04-09_12L_24min_Vocab1792_FlashMuon_LinearScaleInit_XSA5LastGated_RReLU2_Int6AWQMixedBits

Final result:

  • val_bpb: 1.10768987
  • val_loss: 2.15009312
  • total submission size: 18,629,446 bytes

Submission summary:

  • 12-layer practice branch
  • 24 minutes of training
  • large-vocabulary configuration
  • Flash Muon
  • XSA enabled on the last 5 layers, with only the final XSA layer gated
  • linear-by-depth initialization for attn_scale and mlp_scale
  • RReLU2 MLP
  • late EMA plus post-train best-choice selection
  • mixed-bit int6 AWQ + lzma export
  • val-tail calibration

Mixed-bit quantization details:

  • default export stays int6 for most tensors
  • selected sensitive tensors are promoted to int8 through QUANT_INT8_NAME_PATTERNS
  • default sensitive tensor list:
    • tok_emb.weight
    • lm_head.weight

Notes:

  • this is a practice/non-record submission intended to demonstrate that the branch continues to improve with more depth and longer training, not only under the strict challenge budget
  • the measured result reaches about 1.10 bpb in this larger and longer setting

@MatoTeziTanka
Copy link
Copy Markdown

Community Review — Add non-record submission: 12L 24min Vocab1792 FlashMuon LinearScaleInit XSA5LastGated RReLU2 Int6AWQ MixedBits

Compliance: LOOKS CLEAN — pure-neural submission, no TTT/SLOT/n-gram-cache

Both train_gpt.py files in this PR (Apr 7 and Apr 9 entries by shram86) are clean.

N-gram family bug check — CLEAR
No usage of ctx_hash, full_key, primes, or any target-XOR hash lookup was found in either file. The only XOR-like operator (& / ~) appears in byte-count LUT lookups for computing bits-per-byte metrics (base_bytes_lut, has_leading_space_lut, is_boundary_token_lut), which are purely metrological and involve no target leakage into any input hash key. LEGAL.

Pre-Quant TTT / val_tokens optimizer step check — CLEAR
val_tokens is used in exactly three contexts:

  1. eval_val() (line 413) — wrapped in torch.inference_mode(), no gradients, read-only scoring. LEGAL.
  2. sliding_window_evaluation() (line 475) — wrapped in torch.inference_mode(), no gradients. LEGAL.
  3. sample_validation_calibration_sequences_tail() (line 642) → feeds collect_activation_stats() (line 652) — wrapped in torch.inference_mode(), used only to collect forward-pass activation statistics for AWQ column scaling during post-training quantization. No backward() call, no optimizer.step(). LEGAL.

All optimizer steps (opt.step() at lines 1485, 1588) operate exclusively on train-split data from train_loader.next_batch(). No multi-epoch AdamW loop on val_tokens exists anywhere.

Scored-region SLOT check — CLEAR
No masking + optimizing + scoring the same region was found.

Architecture
Both submissions are pure transformer variants (XSA attention on last 5 layers, RReLU2 MLP, Flash Muon optimizer, EMA + best-checkpoint selection, int6 AWQ+lzma post-training quantization). The Apr 9 entry extends this with mixed-bit quantization (int8 for embeddings/LM head via QUANT_INT8_NAME_PATTERNS). No exotic test-time adaptation of any kind.

Files reviewed: Both train_gpt.py at HEAD SHA 4eb5d777f74930c39ad9924c2df152b81c0f9f43.

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the audit — this looks like a clean pure-neural submission.


Reviewed by @MatoTeziTankaThe Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants