Skip to content

Non-Record: Polar Express Muon negative result (1.0805 BPB, +0.0004 vs standard NS5)#1516

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:polar-express-nonrecord
Open

Non-Record: Polar Express Muon negative result (1.0805 BPB, +0.0004 vs standard NS5)#1516
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:polar-express-nonrecord

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Non-Record Submission: Polar Express Muon (Negative Result)

val_bpb: 1.08049 (single seed s42) | 16.00 MB | 8×H100 SXM

This is a non-record submission documenting a negative result for Polar Express Muon, an alternative Newton-Schulz orthogonalization variant in the Muon optimizer.

What is Polar Express Muon?

Standard Muon uses a 5-step Newton-Schulz iteration to approximate the matrix polar decomposition for preconditioning. Polar Express replaces this with an alternative spectral approximation that has different convergence properties.

Result

Variant val_bpb Delta
Standard Muon NS5 (baseline) 1.08006
Polar Express Muon 1.08049 +0.00043

Polar Express is slightly worse (+0.00043 BPB) than standard NS5 on this stack. The standard Newton-Schulz iteration is already near-optimal for this architecture at ~4900 steps.

Why this is interesting

  • Documents that the NS orthogonalization variant space is largely exhausted for this model size
  • Saves future researchers from re-testing this direction on the sp8192 stack
  • The delta (+0.0004 BPB) is small enough that Polar Express might win on different architectures or step counts

Credits

Base stack: @clarkkev PR #1394.

Test plan

  • Single seed (s42) verification
  • Artifact under 16 MB (15,998,547 bytes)
  • Training under 600s (588s)

…pb 1.08049

Non-record submission documenting a negative result for Polar Express
Muon, an alternative Newton-Schulz orthogonalization variant.

Result: val_bpb 1.08049 (+0.00043 worse than standard Muon NS5 at
1.08006 on the same stack). The standard NS iteration is already
near-optimal for this architecture at this step count.

Single seed (s42), 15.998 MB, 588s train, legal score-first TTT.

Credits: @clarkkev (PR openai#1394 baseline).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant