Skip to content

Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03#859

Closed
bigbag wants to merge 5 commits intoopenai:mainfrom
bigbag:submission/learned-mixer-noTTT-lr03-0.1582
Closed

Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03#859
bigbag wants to merge 5 commits intoopenai:mainfrom
bigbag:submission/learned-mixer-noTTT-lr03-0.1582

Conversation

@bigbag
Copy link
Copy Markdown

@bigbag bigbag commented Mar 26, 2026

Summary

val_bpb = 0.1584 (3-seed mean, std 0.0008) | 15.72-15.76 MB | 8xH100 SXM | No TTT

Two changes from PR #834: MATRIX_LR=0.03 (was 0.025) and TTT_EPOCHS=0. Beats PR #834's 0.1663 WITH TTT by removing TTT entirely and using a higher matrix learning rate.

Results

Seed Steps ms/step Sliding BPB Mixer BPB Artifact
42 4,940 114 1.1362 0.1575 15,758,015
1337 4,930 114 1.1353 0.1585 15,723,194
2024 4,937 114 1.1366 0.1591 15,724,500
Mean 0.1584 ± 0.0008

Key Contribution

Removing TTT (TTT_EPOCHS=0) while increasing MATRIX_LR=0.03 produces a BETTER result than the original PR #834 WITH TTT. The higher LR trains a better neural model that the learned mixing head leverages more effectively.

The MATRIX_LR=0.03 finding was discovered through systematic screening of 79+ experiments across RTX4500 and 8xH100 GPUs.

Architecture (from PR #834)

  • Learned mixer head: Linear(512 → 7) predicts per-token expert weights
  • Frozen n-gram oracle during training
  • Score-first backward-looking n-gram eval cache (orders 2-7)
  • 11L, MHA 8/8, MLP 3.5x, LeakyReLU(0.5)²
  • CROWN-Q + GPTQ + zstd, EMA(0.997)

Legality

  • No TTT — zero gradient updates on validation data
  • N-gram cache backward-looking (score-first)
  • Single-pass evaluation
  • Mixing head trained on training data only

Reproduction

MATRIX_LR=0.03 TTT_EPOCHS=0 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • 8xH100 SXM, seed 42: 0.1575 BPB, 15.76 MB
  • 8xH100 SXM, seed 1337: 0.1585 BPB, 15.72 MB
  • 8xH100 SXM, seed 2024: 0.1591 BPB, 15.72 MB
  • 3-seed mean: 0.1584 ± 0.0008
  • All artifacts ≤ 16MB (15.72-15.76 MB)
  • Training ≤ 600s
  • Eval ≤ 600s (513-517s)
  • No TTT — fully legal

Based On

🤖 Generated with Claude Code

Two changes from PR openai#834: MATRIX_LR=0.03 and TTT_EPOCHS=0.
Beats PR openai#834's 0.1663 WITH TTT by removing TTT and using higher LR.

- Learned mixer head: Linear(512→7) predicts per-token expert weights
- No TTT — zero gradient updates on validation data
- N-gram backoff cache (orders 2-7), single-pass, backward-looking
- 11L, MHA 8/8, MLP 3.5x, 15.59 MB artifact
- 8xH100 SXM, 600s training, 515s eval

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag
Copy link
Copy Markdown
Author

bigbag commented Mar 26, 2026

Technical Report: Why This Approach Achieves 0.1582 BPB

Architecture Overview

The system combines two complementary prediction systems with a learned routing mechanism:

Neural Predictor: An 11-layer transformer (MHA 8/8, MLP 3.5x, LeakyReLU(0.5)²) trained on the standard next-token prediction objective. This component excels at semantic understanding, long-range dependencies, and novel constructions.

N-gram Cache: A backward-looking hash table (orders 2-7) built incrementally from already-scored validation tokens. This component captures local repetition patterns, domain-specific vocabulary, and common phrases with high accuracy.

Learned Mixing Head: A Linear(512 → 7) output head trained to predict per-token mixing weights over the neural model and each n-gram order. Unlike fixed entropy-based heuristics, this head reads the full transformer hidden state and learns context-specific routing: when to trust the neural model versus each n-gram order.

Why Removing TTT Improves Results

Our submission achieves 0.1582 BPP without test-time training (TTT_EPOCHS=0), beating the original PR #834's 0.1663 BPP which included TTT. We hypothesize two reasons:

  1. Learned mixing weights are already optimized for the train-eval distribution shift. The frozen n-gram oracle during training exposes the mixing head to the same types of n-gram predictions it will encounter at eval time. TTT can perturb these carefully-learned weights.

  2. Higher matrix learning rate (0.03 vs 0.025) produces a better neural base model. Discovered through systematic screening of 79+ experiments, this higher LR enables faster convergence within the 600-second training budget, resulting in both better neural predictions and better-trained mixing weights.

Why Matrix LR 0.03 Helps

Through systematic hyperparameter screening (steps 10-13 of our experiment pipeline, 79+ experiments across RTX4500 and 8xH100), we found that MATRIX_LR=0.03 consistently outperforms the default 0.02-0.025 across multiple architectures:

The higher LR enables more gradient updates to reach a better minimum within the fixed 600-second training budget.


Legality Analysis

This submission is designed to be fully compliant with all competition rules:

  1. No Test-Time Training: TTT_EPOCHS=0. Zero gradient updates on validation data. The model weights are frozen after the 600-second training phase.

  2. Score-First N-gram Cache: Each validation chunk is scored first using the current model + cache state. The cache is updated only AFTER scoring is complete. No future information is used.

  3. Single-Pass Evaluation: Each token is scored exactly once. No rescoring, no multi-pass evaluation, no oracle selection after seeing labels.

  4. Frozen Oracle During Training: The n-gram tables used during training are precomputed from training data only. No validation data is accessed during the training phase. The oracle is read-only (no gradient flow into n-gram tables).

  5. Committed Prediction Distribution: For each token, the system commits to a single probability distribution (the learned mixture of neural + n-gram experts) before the ground truth is revealed. The mixing weights are a deterministic function of the transformer hidden state, not a post-hoc selection.

  6. Artifact Compliance: 15.59 MB total (code 92 KB + model 15.50 MB), well within the 16 MB limit.

  7. Time Compliance: Training 600s, evaluation 515s (both within their respective 600s budgets).

Pavel Liashkov and others added 2 commits March 26, 2026 22:55
Seeds 42 (0.1582), 1337 (0.1583), 2024 (0.1583).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ahmettrkck added a commit to ahmettrkck/parameter-golf that referenced this pull request Mar 26, 2026
Combines PR openai#834 learned multi-expert gate with:
- Value Residual (PR openai#413, -0.015 bpb)
- Gated Attention (PR openai#413, -0.003 bpb)
- LeakyReLU(0.9)^2 (issue openai#140 sweep, +0.013 over 0.5)
- MATRIX_LR=0.03 (PR openai#859 finding)
- TTT_EPOCHS=0 (PR openai#859 finding)

Nobody has combined learned gate with VR+GA before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag bigbag changed the title Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03 Non Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03 Mar 27, 2026
Pavel Liashkov and others added 2 commits March 27, 2026 12:07
Seeds 42 (0.1582) and 1337 (0.1583) confirmed.
Seed 2024 artifact exceeds 16MB (seed-dependent compression).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Seeds 42 (0.1575), 1337 (0.1585), 2024 (0.1591).
All artifacts under 16MB (15.72-15.76 MB).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bigbag bigbag changed the title Non Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03 Record: 0.1582 BPB — Learned Mixer Head + No TTT + Matrix LR 0.03 Mar 27, 2026
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to the use of hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
Three improvements over Green v1 (baseline: sliding=1.1129, ngram9=0.4489):

PR openai#931 (packed training oracle): after training, reads 2 train shards
(~200M tokens) and seeds eval n-gram tables before val token openai#1.
Eliminates cold-start penalty where early val chunks score with empty cache.
Legal: oracle is training-data-only, eval remains single-pass causal.

PR openai#900 (Dirichlet smoothing): replaces linear alpha mixing with
  p = (ng_count + c * neural_p) / (ctx_count + c)
Count-sensitive weighting: high-count matches trust n-gram, low-count
matches stay close to neural prior. No hand-tuned alpha per-order needed.
NGRAM_EVAL_MIN_COUNT=1 (formula handles low counts naturally).

PR openai#859 (matrix_lr): MATRIX_LR=0.03 vs 0.025 in Green — higher LR
found across 79-experiment sweep to train stronger base model.

Both new features are independent toggles (ARTIFACT_NGRAM, NGRAM_DIRICHLET)
for A/B isolation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
newjordan pushed a commit to newjordan/parameter-golf-1 that referenced this pull request Mar 28, 2026
Phrase cache (PR openai#880 / PR openai#900 — proven +0.1 BPB, legal):
- Variable-length suffix matching at 48/36/28/20/16 token probe lengths
- One ctx+full count table pair per probe length (4M buckets each)
- 48-prime XOR hash — unique prime per context position up to length 48
- Dirichlet smoothing: p=(min(fc,cc)+c*neural)/(ctx+c), c=2.0
- Applied inline after n-gram mixing, before NLL conversion
- Score-first: tables updated with chunk tokens AFTER all scoring done

RegimeTracker (PR openai#880):
- Tracks match rate + token diversity over rolling 4096-token window
- Adapts effective phrase concentration: repetitive/boilerplate content
  → lower c (more cache trust); novel prose → higher c (more neural trust)
- Multiplier range [0.7, 1.5], effective_c = base_c / mult

Config improvements:
- WARMDOWN_ITERS=2000 (confirmed best from A/B sweep)
- NGRAM_CHUNK_TOKENS=65536 (PR openai#850, 15x more cache refreshes vs 1M)
- MATRIX_LR=0.03 (PR openai#859)

ARTIFACT_NGRAM=0 remains disabled (legally gray).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants