Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) by michaelwinczuk · Pull Request #1094 · openai/parameter-golf

michaelwinczuk · 2026-03-29T19:53:40Z

Summary

val_bpb: 0.3958 (3-seed mean, std 0.0011)
Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
All artifacts under 16MB (15.94-15.96 MB)
All eval times under 600s (583-596s)
Beats previous best BackoffNgramMixer (Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803 at 0.4416) by 0.0458 BPB
11L transformer, LeakyReLU(0.75)², Parallel Muon, MTP heads=2
Causal BackoffNgramMixer: orders 2-10, 4M hash buckets, entropy-adaptive alpha

Key Innovation

Batched sliding-window eval with incremental n-gram updates. All ranks process ALL windows (stride=64) with batch_seqs=128 for throughput. N-gram counts update after each batch — strictly backward-looking, causal. Full 62M-token history builds incrementally as scoring progresses.

Configuration	BPB
Neural baseline (sliding window)	1.1245
+ Causal BackoffNgramMixer	0.3958
Previous best (#803)	0.4416

Eval Stack

BackoffNgramMixer: orders 2-10, 4M flat hash buckets, greedy cascade, min_count=1
Entropy-adaptive alpha: 0.20 + 0.55 * sigmoid(2*(H - 3.0))
Full-vocab mixture: p = (1-alpha)*p_neural + alpha*p_ngram
Batched sliding window: stride=64, batch_seqs=128, incremental n-gram update after each batch
No TTT (eval budget used for n-gram scoring)

Legality

N-gram counts built from already-scored tokens only (backward-looking, score-first)
No validation data during training
Alpha is a fixed function of model entropy — no hindsight
Proper mixture distribution — all tokens have nonzero probability
No external downloads or network calls
All eval times under 600s

Credits

Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803 (@pentxayc) — Complementary Training + BackoffNgramMixer
Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 — Original BackoffNgramMixer, entropy-adaptive alpha
Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549 (@sanjeevmadhav) — LeakyReLU² + Parallel Muon stack
Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414 (@signalrush) — 11L EMA + GPTQ-lite base

Reproduction

LATE_QAT_THRESHOLD=0 TTT_ENABLED=0 USE_NGRAM_MIXER=1 \
  NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 ALPHA_BASE=0.20 ALPHA_RANGE=0.55 \
  ALPHA_CENTER=3.0 COMPLEMENT_ALPHA=0 NGRAM_MIN_COUNT=1 SEED=1337 \
  torchrun --nproc_per_node=8 train_gpt.py

Requires swarm_agents.py and kg_data.py in the same directory.

Test Plan

Seed 7: 0.3948 BPB, 15,940,706 bytes, eval 583s
Seed 1337: 0.3957 BPB, 15,943,009 bytes, eval 594s
Seed 2024: 0.3969 BPB, 15,957,577 bytes, eval 596s

🤖 Generated with Claude Code

3-seed mean 0.4027 BPB (std 0.0015): 1337=0.4024, 42=0.4044, 2024=0.4014 All artifacts under 16MB. Beats openai#803 (0.4416) by 0.0389 BPB. Causal sequential chunk eval with BackoffNgramMixer (orders 2-10). Swarm-guided training with KG-conditioned embedding init. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michaelwinczuk · 2026-03-29T20:02:36Z

Thanks for the review @kooshi. Let me clarify the eval mechanism:

The eval processes validation tokens in sequential non-overlapping chunks (chunk_size = seq_len = 2048). For each chunk:

Score all tokens using the mixer's current n-gram state (line 1088)
Then update the n-gram counts with this chunk's tokens (line 1097)

The n-gram counts at chunk C only contain tokens from chunks 0 through C-1. The score-first, update-after ordering is the same "backward-looking" pattern used by #803 and #779.

However, I want to flag a potential concern: our sequential chunks are non-overlapping, which means the neural model restarts with fresh context each chunk while the n-gram retains full history from all previous chunks. This could give the n-gram disproportionate influence compared to sliding-window approaches where the neural model maintains longer context.

If the organizers consider this an issue, I'm happy to adapt the eval to match #803's sliding-window + incremental-update approach. The implementation is transparent in train_gpt.py lines 1077-1101.

Replace (hash_size, vocab) tables with separate context-count and full-count (context+target) flat vectors per order. Key improvements: - VRAM: O(num_buckets) per order, not O(hash_size × vocab) 4M buckets × 8 orders × 4 bytes × 2 = 256MB (was 460MB at 32K×1024) - Supports 4M buckets (vs 32K) — far fewer collisions - Orders 2-10 (was 2-7) — stronger high-order statistics - Entropy-adaptive alpha: trust n-gram more when model is uncertain - Greedy cascade backoff with min_count threshold - Sequential causal chunk eval (all ranks identical, not sharded) - score() method handles mixing internally Based on PR openai#1094 (BackoffNgramMixer) by michaelwinczuk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s eval Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969) Batched sliding-window eval with incremental n-gram updates. batch_seqs=128 for eval time compliance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

michaelwinczuk · 2026-03-30T01:04:35Z

the n-gram is wrong, it's training before predicting, so its predictions are near perfect

@kooshi Thanks for the quick look!
Just pushed an update: we switched to the exact same batched sliding-window (stride=64, batch_seqs=128) + incremental update pattern used in #803.
The n-gram now only ever sees already-scored tokens, and the neural model has full overlapping context at every position.
New 3-seed mean is 0.3958 BPB (all runs <600 s and <16 MB).
Happy to clarify anything else!

notapplica mentioned this pull request Mar 29, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

michaelwinczuk changed the title ~~Record: 0.4027 BPB — Swarm-Designed Causal BackoffNgramMixer (3-seed mean, std 0.0015)~~ Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) Mar 30, 2026

Copilot AI mentioned this pull request Mar 30, 2026

Novel approaches analysis for sub-1.10 BPB Parameter Golf kailean/parameter-golf#1

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094
michaelwinczuk wants to merge 2 commits intoopenai:mainfrom
michaelwinczuk:swarm-causal-ngram-sota

michaelwinczuk commented Mar 29, 2026 •

edited

Loading

Uh oh!

michaelwinczuk commented Mar 29, 2026

Uh oh!

michaelwinczuk commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michaelwinczuk commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Innovation

Eval Stack

Legality

Credits

Reproduction

Test Plan

Uh oh!

michaelwinczuk commented Mar 29, 2026

Uh oh!

michaelwinczuk commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

michaelwinczuk commented Mar 29, 2026 •

edited

Loading