Skip to content

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094

Open
michaelwinczuk wants to merge 2 commits intoopenai:mainfrom
michaelwinczuk:swarm-causal-ngram-sota
Open

Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011)#1094
michaelwinczuk wants to merge 2 commits intoopenai:mainfrom
michaelwinczuk:swarm-causal-ngram-sota

Conversation

@michaelwinczuk
Copy link
Copy Markdown

@michaelwinczuk michaelwinczuk commented Mar 29, 2026

Summary

  • val_bpb: 0.3958 (3-seed mean, std 0.0011)
  • Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
  • All artifacts under 16MB (15.94-15.96 MB)
  • All eval times under 600s (583-596s)
  • Beats previous best BackoffNgramMixer (Record: 0.4416 BPB -- Complementary Training + Backoff N-gram Mixer #803 at 0.4416) by 0.0458 BPB
  • 11L transformer, LeakyReLU(0.75)², Parallel Muon, MTP heads=2
  • Causal BackoffNgramMixer: orders 2-10, 4M hash buckets, entropy-adaptive alpha

Key Innovation

Batched sliding-window eval with incremental n-gram updates. All ranks process ALL windows (stride=64) with batch_seqs=128 for throughput. N-gram counts update after each batch — strictly backward-looking, causal. Full 62M-token history builds incrementally as scoring progresses.

Configuration BPB
Neural baseline (sliding window) 1.1245
+ Causal BackoffNgramMixer 0.3958
Previous best (#803) 0.4416

Eval Stack

  • BackoffNgramMixer: orders 2-10, 4M flat hash buckets, greedy cascade, min_count=1
  • Entropy-adaptive alpha: 0.20 + 0.55 * sigmoid(2*(H - 3.0))
  • Full-vocab mixture: p = (1-alpha)*p_neural + alpha*p_ngram
  • Batched sliding window: stride=64, batch_seqs=128, incremental n-gram update after each batch
  • No TTT (eval budget used for n-gram scoring)

Legality

  1. N-gram counts built from already-scored tokens only (backward-looking, score-first)
  2. No validation data during training
  3. Alpha is a fixed function of model entropy — no hindsight
  4. Proper mixture distribution — all tokens have nonzero probability
  5. No external downloads or network calls
  6. All eval times under 600s

Credits

Reproduction

LATE_QAT_THRESHOLD=0 TTT_ENABLED=0 USE_NGRAM_MIXER=1 \
  NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 ALPHA_BASE=0.20 ALPHA_RANGE=0.55 \
  ALPHA_CENTER=3.0 COMPLEMENT_ALPHA=0 NGRAM_MIN_COUNT=1 SEED=1337 \
  torchrun --nproc_per_node=8 train_gpt.py

Requires swarm_agents.py and kg_data.py in the same directory.

Test Plan

  • Seed 7: 0.3948 BPB, 15,940,706 bytes, eval 583s
  • Seed 1337: 0.3957 BPB, 15,943,009 bytes, eval 594s
  • Seed 2024: 0.3969 BPB, 15,957,577 bytes, eval 596s

🤖 Generated with Claude Code

3-seed mean 0.4027 BPB (std 0.0015): 1337=0.4024, 42=0.4044, 2024=0.4014
All artifacts under 16MB. Beats openai#803 (0.4416) by 0.0389 BPB.

Causal sequential chunk eval with BackoffNgramMixer (orders 2-10).
Swarm-guided training with KG-conditioned embedding init.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michaelwinczuk
Copy link
Copy Markdown
Author

Thanks for the review @kooshi. Let me clarify the eval mechanism:

The eval processes validation tokens in sequential non-overlapping chunks (chunk_size = seq_len = 2048). For each chunk:

  1. Score all tokens using the mixer's current n-gram state (line 1088)
  2. Then update the n-gram counts with this chunk's tokens (line 1097)

The n-gram counts at chunk C only contain tokens from chunks 0 through C-1. The score-first, update-after ordering is the same "backward-looking" pattern used by #803 and #779.

However, I want to flag a potential concern: our sequential chunks are non-overlapping, which means the neural model restarts with fresh context each chunk while the n-gram retains full history from all previous chunks. This could give the n-gram disproportionate influence compared to sliding-window approaches where the neural model maintains longer context.

If the organizers consider this an issue, I'm happy to adapt the eval to match #803's sliding-window + incremental-update approach. The implementation is transparent in train_gpt.py lines 1077-1101.

MichaelMcCulloch pushed a commit to MichaelMcCulloch/parameter-golf that referenced this pull request Mar 30, 2026
Replace (hash_size, vocab) tables with separate context-count and
full-count (context+target) flat vectors per order. Key improvements:
- VRAM: O(num_buckets) per order, not O(hash_size × vocab)
  4M buckets × 8 orders × 4 bytes × 2 = 256MB (was 460MB at 32K×1024)
- Supports 4M buckets (vs 32K) — far fewer collisions
- Orders 2-10 (was 2-7) — stronger high-order statistics
- Entropy-adaptive alpha: trust n-gram more when model is uncertain
- Greedy cascade backoff with min_count threshold
- Sequential causal chunk eval (all ranks identical, not sharded)
- score() method handles mixing internally

Based on PR openai#1094 (BackoffNgramMixer) by michaelwinczuk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s eval

Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
Batched sliding-window eval with incremental n-gram updates.
batch_seqs=128 for eval time compliance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@michaelwinczuk michaelwinczuk changed the title Record: 0.4027 BPB — Swarm-Designed Causal BackoffNgramMixer (3-seed mean, std 0.0015) Record: 0.3958 BPB — Causal BackoffNgramMixer (3-seed, std 0.0011) Mar 30, 2026
@michaelwinczuk
Copy link
Copy Markdown
Author

the n-gram is wrong, it's training before predicting, so its predictions are near perfect

@kooshi Thanks for the quick look!
Just pushed an update: we switched to the exact same batched sliding-window (stride=64, batch_seqs=128) + incremental update pattern used in #803.
The n-gram now only ever sees already-scored tokens, and the neural model has full overlapping context at every position.
New 3-seed mean is 0.3958 BPB (all runs <600 s and <16 MB).
Happy to clarify anything else!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant