Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) by erichroepke · Pull Request #1416 · openai/parameter-golf

erichroepke · 2026-04-06T14:46:25Z

Summary

val_bpb: 1.07948 (3-seed mean, std=0.00043) | Artifact: 15.12 MB

Seed	Sliding BPB	Artifact
1337	1.07920	15,117,282
42	1.07927	15,115,229
2025	1.07997	15,131,140

What This Is

Simple combination of two existing PRs:

@clarkkev's Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 as the base (SP8192, SDClip, GPTQ embeddings, skip gates, depth recurrence — all the good stuff)
@stukenov's Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364 pre-quant AdamW TTT bolted on top (6 epochs on EMA model before GPTQ)

That's basically it. Turns out you can apply pre-quant TTT to the SP8192 base and the two techniques don't interfere. TTT adapts the full-precision model before quantization, then SDClip + GPTQ compresses the adapted weights cleanly.

TTT gives about -0.034 BPB on this base (post-EMA 1.1019 → post-TTT 1.0682).

Supersedes my earlier PR #1396 (1.1067 BPB).

Credits

Nearly everything here is other people's work:

@clarkkev (Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394) — the entire base architecture and compression pipeline
@stukenov (Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) #1364) — the pre-quant TTT technique
@Omrigotlieb (Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204) — depth recurrence
@unnir (Non Record: MuonEq-R + Context-Only SLOT + QK_GAIN=5.0 — val_bpb 1.1027 (3-seed mean) #1217) — MuonEq-R

I'm a filmmaker, not an ML engineer. Built with Claude Opus 4.6 as AI co-author.

How to Run

pip install brotli
# SP8192 dataset from @clarkkev's HF: huggingface.co/datasets/kevclark/parameter-golf
DATA_DIR=./data/ SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

@clarkkev

…ed mean) Merges @clarkkev's openai#1394 (SP8192, SDClip, GPTQ embeddings, skip gates) with @stukenov's openai#1364 (pre-quant AdamW TTT). First combination of these techniques. 3-seed mean: 1.07948 BPB (std=0.00043), artifact 15.12 MB. Built with Claude Opus 4.6 as AI co-author. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dexhunter · 2026-04-06T15:12:15Z

Thanks for the detailed writeup. I think the main question for reviewers is not the SP8192 / SDClip side of the stack, but how the pre-quant AdamW TTT step fits the community guidance in #1017.

For readers who have not followed that thread, the four conditions in #1017 are roughly:

no dependence on future tokens,
define an ordinary normalized probability distribution for the next token,
score a token before adapting on that token, and
keep evaluation as a single left-to-right pass.

On my reading, conditions 1, 2, and 4 are not the hard part here. The part I’m struggling to reconcile is condition 3 / score-before-update.

The reason is that the PR README describes this as pre-quant AdamW TTT on validation data before compression, and train_gpt.py also comments that the EMA model is fine-tuned on validation data before quantization and before the final sliding-window evaluation. That reads like adapting the model on the validation stream first and only then scoring that same stream with the adapted/quantized model.

Could you add a short compliance note explaining how this step satisfies the #1017 score-before-update rule, and in particular how the TTT objective is restricted to tokens that have already been scored before they influence later scored tokens?

Issue for reference: #1017

erichroepke · 2026-04-06T19:27:16Z

You're totally right — my apologies, I didn't catch that rule. Stripping TTT, which does reduce the model. Going back to the drawing board on this one. Thanks for the detailed review.

…ctions - N-gram Tilt bug: PR openai#1420 kernel is non-causal; PR openai#1437 (dexhunter) found/fixed it (pre-fix 1.07807 → post-fix 1.08091). Updated primary reference to PR openai#1437 kernel. - PR openai#1423 flagged illegal (pre-quant TTT, same as openai#1351/openai#1408/openai#1416) - Added full PR openai#1421–1444 scan results - Updated best open legal PR: ~1.08091 (PR openai#1437) not 1.08014 (openai#1420) - Session 8 lessons learned added to CLAUDE.md https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC

erichroepke mentioned this pull request Apr 6, 2026

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean) #1396

Closed

4 tasks

This was referenced Apr 6, 2026

Record: SP8192 + Pre-Quant TTT + QK-Gain 5.0 + Depth Recurrence + MuonEq-R — val_bpb 1.0791 (3-seed mean) #1423

Open

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean)#1416

Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean)#1416
erichroepke wants to merge 1 commit intoopenai:mainfrom
erichroepke:submission/sp8192-prequant-ttt-sdclip-v2

erichroepke commented Apr 6, 2026

Uh oh!

dexhunter commented Apr 6, 2026

Uh oh!

erichroepke commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erichroepke commented Apr 6, 2026

Summary

What This Is

Credits

How to Run

Uh oh!

dexhunter commented Apr 6, 2026

Uh oh!

erichroepke commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants