Skip to content

Commit 44d290d

Browse files
Octavianclaude
andcommitted
Add 3 SOTA improvement experiments: MTP, SwiGLU, Vocab1536
exp_a: Multi-Token Prediction (MTP_NUM_HEADS=2, excluded from export) exp_b: SwiGLU MLP replacing ReLU² (hidden=1024, same param count) exp_c: Vocab 1536 tokenizer for better bytes-per-token ratio All based on PR #254 SOTA clone (1.1303 BPB). Priority: exp_c first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 4e4cc7f commit 44d290d

18 files changed

Lines changed: 5795 additions & 0 deletions

exp_a/README.md

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# FarnsworthEngine v1: TTT + 11L Int6 MLP3x
2+
3+
**Author:** Farnsworth Tech
4+
**Date:** 2026-03-20
5+
**Score:** val_bpb = 1.1303 (seed 1337, seeds 42 and 7 in progress)
6+
7+
## Summary
8+
9+
FarnsworthEngine stacks **Test-Time Training (TTT)** on top of an optimized 11-layer MLP3x Int6 architecture. TTT adapts all model weights to the validation distribution via full-weight SGD before scoring, providing a consistent ~0.02 BPB improvement on top of sliding window evaluation.
10+
11+
## Architecture & Techniques
12+
13+
| Component | Details |
14+
|-----------|---------|
15+
| **Layers** | 11 transformer layers, 512 dim, 8 heads, 4 KV heads (GQA) |
16+
| **MLP** | 3x expansion (hidden=1536), ReLU² activation |
17+
| **Quantization** | Int6 mixed precision (MLP+attention), Int8 (embeddings), FP16 tied embeddings |
18+
| **Compression** | zstd-22, artifact 15.88 MB |
19+
| **SmearGate** | Learned sigmoid token blending gate (~512 params) |
20+
| **BigramHash** | 2048-bucket hash embedding for token-pair features (dim 128) |
21+
| **Initialization** | Orthogonal + muP (maximal update parameterization) |
22+
| **Optimizer** | Muon (WD=0.04, momentum=0.99, warmup 1500 steps, warmdown 3000) |
23+
| **SWA** | Stochastic Weight Averaging, 7 checkpoint average during warmdown |
24+
| **Attention** | FlashAttention 3 (Hopper native kernel) |
25+
| **Position** | NTK-RoPE (base=50000) for long-context extrapolation |
26+
| **Sequence** | Train@2048, eval@2048 |
27+
| **TTT** | Full-weight SGD adaptation on val data (lr=0.002, momentum=0.9, 3 epochs) |
28+
| **Eval** | Sliding window stride=64 with TTT-adapted weights |
29+
30+
## TTT: Test-Time Training
31+
32+
The key innovation is adapting model weights to the validation distribution before scoring:
33+
34+
1. **TTT Adaptation (~43s on 8xH100):** SGD with momentum over val data, 3 epochs, freezing first 2 blocks for stability
35+
2. **Sliding Window Scoring (~86s on 8xH100):** Standard stride-64 eval using adapted weights
36+
37+
TTT is effectively adaptive compression — similar in spirit to Lempel-Ziv, the model learns the test distribution online before being evaluated on it.
38+
39+
## Results
40+
41+
| Seed | Steps | Step Avg | Pre-TTT BPB | Post-TTT BPB | Sliding BPB |
42+
|------|-------|----------|-------------|--------------|-------------|
43+
| 1337 | 7,248 | 81.5ms | 1.1447 | 1.1528 | **1.1303** |
44+
| 42 | 7,248 | 81.6ms | 1.1449 | 1.1535 | **1.1312** |
45+
| 7 | 7,353 | 81.6ms | 1.1453 | 1.1547 | **1.1323** |
46+
| **Mean** | | | | | **1.1313** |
47+
48+
- Artifact size: 15,700,261 bytes (under 16,000,000 limit)
49+
- Training time: 600s (wallclock cap)
50+
- Eval time: ~129s (43s TTT + 86s sliding window)
51+
52+
## Reproduction
53+
54+
```bash
55+
SEED=1337 NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=2048 \
56+
MUON_WD=0.04 ADAM_WD=0.04 \
57+
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
58+
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
59+
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3000 \
60+
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
61+
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_MOMENTUM=0.9 \
62+
torchrun --standalone --nproc_per_node=8 train_gpt.py
63+
```
64+
65+
## Timing Budget
66+
67+
| Phase | Time | Budget |
68+
|-------|------|--------|
69+
| Training | 600s | 600s |
70+
| TTT | 43s ||
71+
| Sliding eval | 86s ||
72+
| **Total eval** | **129s** | **600s** |

exp_a/run.sh

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# EXP A: Multi-Token Prediction (MTP)
5+
# Same SOTA base but with MTP_NUM_HEADS=2 during training.
6+
# MTP heads are excluded from export → zero artifact size cost.
7+
# Hypothesis: auxiliary future-token prediction loss improves internal representations.
8+
9+
LOGDIR="logs/exp_a_mtp_$(date +%Y%m%d_%H%M%S)"
10+
mkdir -p "$LOGDIR"
11+
12+
echo "============================================"
13+
echo " EXP A: MTP-2 heads on SOTA 254 base"
14+
echo " Logs: $LOGDIR"
15+
echo "============================================"
16+
17+
SEED="${SEED:-1337}" \
18+
NUM_LAYERS=11 \
19+
BIGRAM_VOCAB_SIZE=2048 \
20+
MUON_WD=0.04 \
21+
ADAM_WD=0.04 \
22+
MATRIX_LR=0.025 \
23+
SCALAR_LR=0.025 \
24+
TIED_EMBED_LR=0.035 \
25+
MUON_MOMENTUM=0.99 \
26+
MUON_MOMENTUM_WARMUP_START=0.92 \
27+
MUON_MOMENTUM_WARMUP_STEPS=1500 \
28+
WARMDOWN_ITERS=3000 \
29+
ITERATIONS=9000 \
30+
MAX_WALLCLOCK_SECONDS=600 \
31+
EVAL_STRIDE=64 \
32+
TTT_ENABLED=1 \
33+
TTT_LR=0.002 \
34+
TTT_EPOCHS=3 \
35+
TTT_MOMENTUM=0.9 \
36+
MTP_NUM_HEADS=2 \
37+
MTP_LOSS_WEIGHT=0.15 \
38+
NCCL_IB_DISABLE=1 \
39+
RUN_ID="exp_a_mtp_s${SEED:-1337}" \
40+
torchrun --standalone --nproc_per_node="${NPROC:-8}" \
41+
exp_a/train_gpt.py \
42+
2>&1 | tee "$LOGDIR/run_s${SEED:-1337}.log"
43+
44+
echo ""
45+
echo "============================================"
46+
echo " EXP A Complete."
47+
echo "============================================"
48+
f="$LOGDIR/run_s${SEED:-1337}.log"
49+
for label in int6_roundtrip int6_sliding_window; do
50+
bpb=$(grep -oP "final_${label}\S* val_loss:\S+ val_bpb:\K\S+" "$f" 2>/dev/null | tail -1)
51+
[ -n "$bpb" ] && echo " ${label}: $bpb" || true
52+
done

exp_a/run_sota254.sh

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
#!/usr/bin/env bash
2+
set -euo pipefail
3+
4+
# EXACT CLONE of PR #254 — Current best pending SOTA (1.1313 BPB)
5+
# 11L Int6 MLP3x + SmearGate + BigramHash + TTT SGD 3 epochs
6+
# Just run it. No modifications.
7+
8+
LOGDIR="logs/sota254_$(date +%Y%m%d_%H%M%S)"
9+
mkdir -p "$LOGDIR"
10+
11+
echo "============================================"
12+
echo " PR #254 EXACT CLONE — 1.1313 BPB target"
13+
echo " 11L + TTT + SmearGate + BigramHash"
14+
echo " Logs: $LOGDIR"
15+
echo "============================================"
16+
17+
SEED="${SEED:-1337}" \
18+
NUM_LAYERS=11 \
19+
BIGRAM_VOCAB_SIZE=2048 \
20+
MUON_WD=0.04 \
21+
ADAM_WD=0.04 \
22+
MATRIX_LR=0.025 \
23+
SCALAR_LR=0.025 \
24+
TIED_EMBED_LR=0.035 \
25+
MUON_MOMENTUM=0.99 \
26+
MUON_MOMENTUM_WARMUP_START=0.92 \
27+
MUON_MOMENTUM_WARMUP_STEPS=1500 \
28+
WARMDOWN_ITERS=3000 \
29+
ITERATIONS=9000 \
30+
MAX_WALLCLOCK_SECONDS=600 \
31+
EVAL_STRIDE=64 \
32+
TTT_ENABLED=1 \
33+
TTT_LR=0.002 \
34+
TTT_EPOCHS=3 \
35+
TTT_MOMENTUM=0.9 \
36+
NCCL_IB_DISABLE=1 \
37+
RUN_ID="sota254_s${SEED:-1337}" \
38+
torchrun --standalone --nproc_per_node="${NPROC:-8}" \
39+
sota254/train_gpt.py \
40+
2>&1 | tee "$LOGDIR/run_s${SEED:-1337}.log"
41+
42+
echo ""
43+
echo "============================================"
44+
echo " PR #254 Clone Complete."
45+
echo "============================================"
46+
echo " Target: 1.1313 BPB (3-seed mean)"
47+
f="$LOGDIR/run_s${SEED:-1337}.log"
48+
for label in ttt_sliding sliding_window int8_zlib_roundtrip; do
49+
bpb=$(grep -oP "final_${label}\S* val_loss:\S+ val_bpb:\K\S+" "$f" 2>/dev/null | tail -1)
50+
[ -n "$bpb" ] && echo " ${label}: $bpb" || true
51+
done
52+
steps=$(grep -oP 'stopping_early.*step:\K\d+' "$f" 2>/dev/null | tail -1)
53+
size=$(grep -oP 'Total submission size\S*: \K\d+' "$f" 2>/dev/null | tail -1)
54+
echo " steps=${steps:-N/A} bytes=${size:-N/A}"

exp_a/submission.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"author": "Farnsworth Tech",
3+
"github_id": "timowhite88",
4+
"name": "FarnsworthEngine v1: TTT + 11L Int6 MLP3x",
5+
"blurb": "Test-Time Training (full-weight SGD on val data) stacked on 11L MLP3x Int6 with SmearGate, BigramHash, OrthoInit, Muon WD=0.04, SWA, FA3, NTK-RoPE, FP16 tied embeddings, sliding window eval stride=64.",
6+
"date": "2026-03-20",
7+
"val_loss": 1.90846763,
8+
"val_bpb": 1.13030502,
9+
"bytes_total": 15877181,
10+
"bytes_code": 68212
11+
}

0 commit comments

Comments
 (0)