Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440) by hypery11 · Pull Request #825 · openai/parameter-golf

hypery11 · 2026-03-26T07:07:51Z

Results

Seed	val_bpb	Eval time
42	0.5437	~391s
1337	0.5450	~391s
2024	0.5434	~391s
Mean	0.5440
Std	0.0008

Artifact: ~16.0 MB
Train: 600s on 8xH100 SXM
Eval: ~391s (well under 600s)

Method

11-layer transformer (512d, 8/8 full MHA, XSA-all, LeakyReLU(0.5)^2, 3.5x MLP). Order-adaptive entropy-gated BackoffNgramMixer with per-order entropy thresholds. Score-first, backward-looking, deterministic.

Acknowledgments

Huge thanks to the incredible community that made this possible:

@abaybektursun (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549) — base architecture + Legal TTT + Parallel Muon
@deanbrr (PR Record: 5-gram Eval Cache + LeakyReLU² + Parallel Muon val_bpb: 1.0920 (3-seed mean, std 0.0007) | ~15.9 MB | 8×H100 SXM #659, Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779) — invented the n-gram eval cache for this competition, BackoffNgramMixer
@Asukabot0 (PR Record: XSA-all + LeakyReLU² + VR + GA + 7-gram cache (val_bpb=1.0337) #715, Record: First Legal Sub-1.0 BPB — Multi-order N-gram Backoff + Entropy-Adaptive Alpha (val_bpb=0.9674, 3-seed) #727) — XSA-all concept, entropy-adaptive alpha formula
@gowtham0992 (PR Record: int5 GPTQ + Soft-Round QAT (3-seed mean 1.1162) #606) — int5 + Soft-Round QAT
@signalrush (PR Record: 11L EMA + GPTQ-lite + warmdown3500 + QAT@0.15 (val_bpb=1.1233) #414) — EMA training recipe
@sofiabod (PR Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622) #518) — LeakyReLU activation
@thwu1 (PR Record: 10L Int5-MLP + BigramHash(10240) + SWA(0.4) + WD=0.04 (val_bpb=1.1428, mean 3 seeds) #180) — mixed quantization, BigramHash, SmearGate
@RoyiRa (PR Record Submission: 1.0541 BPB - 5-expert Hedge Mixer + CROWN-Q + stride=64 #700) — TTT framework
@Christopher-Lee-McClendon (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461) — TTT recipe
@raahilshah (PR Record: Int6 MLP3x + SmearGate + BigramHash + MuonWD + SWA (mean val_bpb=1.1483) #162) — int6 quantization

This competition has been an amazing collaborative experience. Every improvement here builds on ideas shared openly.

8xH100 SXM, train <=600s
Eval <=600s (391s)
Artifact <=16MB
3-seed validation (std 0.0008)

Seeds: 0.5437 / 0.5450 / 0.5434 (std 0.0008). Order-adaptive entropy gating + BackoffNgramMixer. ~16MB artifact. Train 600s, eval 391s.

MatoTeziTanka · 2026-03-26T14:30:48Z

Really impressive work — the order-adaptive entropy gating with per-order thresholds is a thoughtful design, and the 3-seed consistency (std 0.0008) is excellent. The acknowledgments section is also great to see — this competition has been genuinely collaborative.

One thing to flag: checking the log output, it looks like seeds 42 and 2024 may exceed the 16,000,000 byte artifact cap:

Seed 1337: 15,948,371 bytes ✅
Seed 42: ~16,022,243 bytes (over by ~22K)
Seed 2024: ~16,030,231 bytes (over by ~30K)

We ran into the exact same issue on our PR #769 seed 42 (over by 25,731 bytes) and had to rerun with tighter quantization. It's a subtle one — the submission.json may not reflect the per-seed sizes accurately.

Might be worth double-checking the individual seed artifact sizes against the 16,000,000 limit before the maintainers review. The fix for us was minor — just tightening the compression/quantization slightly to get the headroom.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

…gramHash 6144, int5, stride=32) + 9-gram prefill

MatoTeziTanka · 2026-04-11T13:13:09Z

Circling back on this one with an updated finding, since @valerio-oai ruled on the underlying mechanism after my first comment.

Compliance flag — same disallowed pattern as PR #779.

@valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." The mechanism is spelled out in the follow-up comment 4146407380: hashing the ground-truth token into the lookup key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data, giving arbitrarily low BPB without real compression.

Looking at records/track_10min_16mb/2026-03-26_OrderAdaptive_BackoffMixer/train_gpt.py, the BackoffNgramMixer (L39–145) is a port of #779's mixer with an entropy-gating delta on top, and it uses the same target-in-key hashing pattern at:

L76 (update): full_key = ((ctx_hash ^ (tgt * self.primes[cw])) & mask).astype(np.int64) — hashes target tgt into the bucket
L78: np.add.at(self.full_counts[oi], full_key, 1) — increments the target-conditioned count
L119 (mix_and_score): full_key = ((ctx_hash ^ (y_np.astype(np.uint64) * self.primes[cw])) & mask).astype(np.int64) — same hash with y_np as the target
L121: full_c = self.full_counts[oi_rev][full_key.reshape(-1)] — looks up the target-conditioned count
L1091: mixer.update(val_tokens[chunk_start_tok:chunk_end_tok + 1]) — also still has the +1 boundary leak I flagged on Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (which was fixed there in commit c58742a after my review; this PR branched from pre-fix code).

Under @valerio-oai's #779 ruling, this is the same Rule 1 violation (Issue #1017 condition 1 — p_t may depend only on the artifact and x_1...x_{t-1}). The 0.5440 BPB number is the predictable outcome of the mechanism, not a true compression result.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #779. The order-adaptive entropy gating (per-order sigmoid centers as a function of best_order) is a clean, well-ablated idea on its own — if @hypery11 wants to resubmit with the n-gram cache replaced by either a full-vocab reweighting (per @valerio-oai's suggested legal path on #779) or with the mixer dropped entirely and just the neural base + Drift-Free TTT, the entropy-gating mechanism should port cleanly.

@hypery11 — please let me know if I've misread the code, especially the full_key lookup at L119; if there's a renormalization step over the full vocabulary that I'm missing, I'd want to retract this. Separately, the seed 42 / seed 2024 artifact-size question from my first comment (~22-30K over the 16MB cap) is still open — would appreciate an update on that one regardless of how the n-gram question lands. The acknowledgments section is also one of the most generous in the queue, and that doesn't go unnoticed.

Reviewed by @MatoTeziTanka — The Agora. Static code review against train_gpt.py at SHA 79ae889a. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code.

@0hq

…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bortlesboat · 2026-04-15T00:51:22Z

Positive compliance note from parameter-golf-checker — running across open Record-claiming PRs to help with triage (#1603).

Manually traced the chunked eval flow here and wanted to leave a clean note so it doesn't get bucketed with the TTT/SLOT cluster even though the n-gram trigger can look similar at a glance.

What I verified (eval_val_sliding_ttt around lines 912–1127):

The chunk loop at line 1020 does exactly what issue #1017 requires:

# line 1020
for ci in range(num_chunks):
    ...
    # line 1025 — Phase 1: SCORE this chunk (inference_mode, no grad)
    base_model.eval()
    with torch.inference_mode():
        ...
        if mixer is not None:
            nll, expert_nll = mixer.mix_and_score(logits_scaled, x_batch, y_batch, wlens)
        ...
        # scoring accumulates into loss_sum / token_count / byte_count

    # line 1087 — Update context mixer with scored chunk tokens
    if mixer is not None:
        mixer.update(val_tokens[chunk_start_tok:chunk_end_tok + 1])

    # line 1098 — Phase 2: TRAIN on this chunk (already scored = legal)
    if not is_last_chunk and ttt_epochs > 0:
        ...
        optimizer.step()

For any token in chunk k:

base_model state is from training on chunks [0, k-1] only (the training step for chunk k runs at line 1098, after the score at line 1025).
mixer n-gram counts are from chunks [0, k-1] only (the mixer.update at line 1091 runs after the score).
Within chunk k, the per-position n-gram context at line 114–119 (x_np[:, :slen - shift]) uses strictly-prior tokens inside x_batch, which is the standard causal context the neural model also sees — no future-token leak.
Scoring happens under torch.inference_mode() at line 1038, so no gradients flow back into any parameter used to score these tokens.

The in-code comment at line 1098 (# --- Phase 2: TRAIN on this chunk (already scored = legal) ---) matches what the control flow actually does. The Polyak swap at 1030–1035 / 1093–1096 is for scoring stability only and does not change which chunks' data touched which state.

The 0.5440 BPB is striking but I don't think it implies a violation — a 7-gram backoff cache that grows over ~47M in-distribution val tokens is a legitimately strong mixer, and "score chunk k → update mixer with chunk k → score chunk k+1" respects the causality constraint at the chunk granularity.

N-gram flag in my tool is a WARN, not a FAIL — just want to flag that clearly in case the C3/N-gram warnings get batched with the actually-illegal cluster.

No action needed on this PR from me. Nice submission.

(I've been wrong before — if I'm misreading something please push back.)

deanbrr · 2026-04-15T01:20:26Z

Hey, I started the entire n-gram hash thread and debate and while we can debate what 'learning' means in this contest, all the approaches have been ruled illegal. "The n-gram cache builds state from evaluation tokens and uses it to predict subsequent tokens. That's eval-time adaptation regardless of whether it's causal."

I do believe this is a very important topic because it presumes the approach is completely wrong when in fact the transformer itself is flawed in this regard:

The standard argument says: "the model should be fixed at eval time, any adaptation is cheating." That draws an arbitrary line.

Consider what a transformer does within its context window: it attends to prior tokens, builds key value representations, and uses them to predict the next token. That is learning from the eval data. It's building an internal model of the local distribution in real time. Nobody calls that illegal. We call it "in context learning" and use it.

The n-gram cache does the same thing with a longer window. A transformer with a 47M token context window would achieve similar benefits and nobody would call that illegal. The cache is just a more parameter efficient implementation of long-range context conditioning.

So the real question isn't "is the predictor fixed". No useful predictor is fixed. Every autoregressive model conditions on previously seen tokens. The question is: where do you draw the line on context length and mechanism?

The competition draws it at "the 16MB artifact should be the complete predictor." But a transformer artifact without any context also predicts horribly. Every predictor requires input data to function.

I think the philosophical distinction is blurry, but there is a practical one. The competition measures "how much can you compress into 16MB of weights." The n-gram cache shifts the answer from "a good model of English" to "a good framework for memorizing any specific text." Those are different capabilities with a different value. Your point about "best predictor vs best specialist" captures this even if the mechanism is philosophically continuous with in context learning, the optimization target diverges.

It's a legitimate debate though, not a clear cut violation as the organizers have suggested.

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440, 3 seeds)

79ae889

Seeds: 0.5437 / 0.5450 / 0.5434 (std 0.0008). Order-adaptive entropy gating + BackoffNgramMixer. ~16MB artifact. Train 600s, eval 391s.

This was referenced Mar 26, 2026

PROTEUS+STYX — val_bpb 0.8495 (3-seed mean) — LeakyReLU(0.9)² + 5-gram Eval Cache #769

Closed

Parameter Golf Formerly Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes. Now disabled #140

Closed

sofiabod added a commit to sofiabod/parameter-golf that referenced this pull request Mar 26, 2026

exp60: adopt PR openai#825 full stack (MHA 8/8, MLP 3.5x, XSA-all, Bi…

40eb1ed

…gramHash 6144, int5, stride=32) + 9-gram prefill

sofiabod mentioned this pull request Mar 26, 2026

Record: Order-Adaptive 9-gram Backoff + Distributed Prefill — val_bpb 0.4405 (3-seed mean) #890

Open

This was referenced Apr 12, 2026

Podracing III: Cubric Lite — 0.9362 BPB #782

Open

Record: XSA-all + LeakyReLU² + VR + GA + 7-gram cache (val_bpb=1.0337) #715

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440)#825

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440)#825
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-26_final_champion

hypery11 commented Mar 26, 2026

Uh oh!

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Bortlesboat commented Apr 15, 2026

Uh oh!

deanbrr commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hypery11 commented Mar 26, 2026

Results

Method

Acknowledgments

Uh oh!

MatoTeziTanka commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Bortlesboat commented Apr 15, 2026

Uh oh!

deanbrr commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MatoTeziTanka commented Mar 26, 2026 •

edited

Loading