Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) by dexhunter · Pull Request #1514 · openai/parameter-golf

dexhunter · 2026-04-09T23:37:30Z

Summary

val_bpb: 1.07983 (3-seed mean, std 0.00050) / 2.78932 nats/token
Artifact: ~15.99 MB (under 16 MB on all 3 seeds)
Delta vs current merged SOTA #1493 (1.0810): 0.00117 bpb / 0.00302 nats/token

Builds on @clarkkev's PR #1394 sp8192 stack and our own PR #1413 legal score-first TTT, adding:

Muon momentum = 0.97 (vs 0.99 default) — single-knob hyperparameter sweep
Causal token n-gram tilt — prefix-only token expert from @abaybektursun's PR #1420 kernel (base_beta=2.0, agree_bonus=0.1); within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal without losing most of the benefit.
Legal score-first TTT — already present in our PR Record: SP8192 + QK-Gain 5 + Legal Score-First TTT — val_bpb 1.08279 (3-seed mean) #1413

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Seed	Pre-TTT sliding	Post-TTT bpb	val_loss (nats)	Artifact
0	1.08102	1.07928	2.78790	15,993,346
42	1.08167	1.07997	2.78967	15,992,995
1234	1.08194	1.08025	2.79039	15,994,604
mean	1.08154	1.07983	2.78932	15,993,648

std_bpb = 0.00050, std_nats = 0.00128. All 3 seeds fit the 16 MB artifact cap and complete under 600s train + 600s eval.

Legality

Score-first TTT only — every sliding-window chunk is scored under inference_mode() before any gradient update. No chunk is trained on before scoring.
Causal n-gram tilt — only the prefix-only token expert is active. The within-word and word-start experts from PR Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420 are explicitly zeroed out. The kernel causality fix per the PR Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) #1420 author thread is applied.
No SLOT, no pre-quant TTT on val data, no n-gram cache, no ETLB.
Full 50k-doc val split, canonical ordering, single left-to-right pass.

Test plan

3-seed verification (seeds 0/42/1234)
Artifact under 16 MB on all seeds
Train under 600s on all seeds (~588s)
Eval under 600s on all seeds (<437s)
No val-data leakage in training
Score-first TTT ordering verified
Causal n-gram tilt verified (prefix-only metadata)

@clarkkev

…val_bpb 1.07983 3-seed mean val_bpb 1.07983 (std 0.00050) on the PR openai#1394 sp8192 stack. Changes from PR openai#1394 + PR openai#1413 baseline: - Muon momentum = 0.97 (vs 0.99 default), warmup 0.92→0.97 unchanged - Causal token n-gram tilt (base_beta=2.0, agree_bonus=0.1) on top of legal score-first TTT; within-word and word-start experts explicitly disabled (within_beta=0, word_beta=0) because they cannot be made fully causal. - 3-seed verification (seeds 0/42/1234) Seeds: - seed 0 → 1.07928 bpb / 2.78790 nats / 15,993,346 bytes - seed 42 → 1.07997 bpb / 2.78967 nats / 15,992,995 bytes - seed 1234 → 1.08025 bpb / 2.79039 nats / 15,994,604 bytes - mean → 1.07983 bpb / 2.78932 nats / 15,993,648 bytes Delta vs current merged SOTA PR openai#1493 (1.0810): 0.00117 bpb / 0.00302 nats per token Credits: @clarkkev (base PR openai#1394 sp8192 stack), @abaybektursun (n-gram tilt kernel PR openai#1420, causal fix applied), prior legal-TTT precedent PR openai#549 / PR openai#461. Platform: 8xH100 80GB SXM, PyTorch 2.9.1+cu128. Training 588s, eval <437s per seed, both under the 600s budget. Artifact under 16 MB on all 3 seeds.

…6, zero-init)

…no hash)

…l PR)

…r repack)

…cked)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R19: openai#1514 base + logit-space hash (8K buckets, bigram key, bf1…

2f474c2

…6, zero-init)

dexhunter changed the title ~~Record: SP8192 + Muon 0.97 + Legal TTT + Causal N-gram Tilt — val_bpb 1.07983 (3-seed)~~ Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean) Apr 10, 2026

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: Deploy openai#1514 base (Muon 0.97 + Tilt + TTT + QK5.0, clean, …

3d46e5e

…no hash)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: openai#1514 packed with FORMAT_RAW+FILTER_LZMA2 (same as origina…

33d00f5

…l PR)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: use EXACT original openai#1514 packed code (their binary, not ou…

334c086

…r repack)

resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 10, 2026

R20: openai#1514 + hidden-space hash embedding (16K buckets, 512d, pa…

5f6fbd4

…cked)

aryanbhosale mentioned this pull request Apr 10, 2026

Record: SP8192 + Muon 0.97 + 3-Layer Recurrence + Parallel Residuals + TTT — val_bpb 1.0802 (3-seed mean) #1521

Open

EthanYangTW mentioned this pull request Apr 10, 2026

Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean)#1514

Record: SP8192 + Muon 0.97 + Legal Score-First TTT — val_bpb 1.07983 (3-seed mean)#1514
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:a2-muon097-3seed

dexhunter commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 9, 2026

Summary

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Legality

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant