Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 by msisovic · Pull Request #1529 · openai/parameter-golf

msisovic · 2026-04-11T00:04:56Z

Record: Improved Parallel Residuals

val_bpb: 1.07531639 (3-seed mean, std 0.0006) | 2.77765390 nats | ~15.96 MB | 8xH100 SXM, 600s | Legal TTT

This submission starts from PR #1523. Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation.

The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block:

next_lane0 = attn_resid * lane0 + attn_post[0] * attn_out + mlp_post[0] * mlp_out
next_lane1 = mlp_resid * lane1 + attn_post[1] * attn_out + mlp_post[1] * mlp_out

That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into lane0, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixed lane0/x0 path, while MLP reads raw lane1. Final output uses the mean of the two lanes.

In practice, that is pretty much the only modeling change here versus PR #1523, together with moving PARALLEL_RESIDUAL_START from the baseline's 7 to 8. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed the cutlass_evt_fusion path to recover the full throughput. PR #1523's logged runs were run with that path available, but it was not included in the submission folder itself. Without it, the wallclock cap gives up too many steps and the gain disappears.

Results (8xH100 80GB SXM, 600s)

Seed	Steps	ms/step	Post-EMA BPB	Legal TTT BPB	val_loss (nats)	Artifact
1337	4,698	125.00	1.0827	1.0746	2.7758	15,956,086
2024	4,746	123.72	1.0836	1.0760	2.7794	15,959,760
42	4,736	123.97	1.0832	1.0754	2.7778	15,954,783
Mean	4726.67	124.23	1.0832	1.07531639	2.77765390	15956876

Reproducibility

pip install brotli sentencepiece
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp8192
for SEED in 1337 2024 42; do
    SEED=$SEED TTT_ENABLED=1 HASH_EMBED_ENABLED=1 TTT_LR=0.01 MUON_MOMENTUM=0.97 PARALLEL_RESIDUAL_START=8 GPTQ_RESERVE_SECONDS=13 \
    torchrun --standalone --nproc_per_node=8 train_gpt.py
done

The cutlass_evt_fusion/ directory should live alongside train_gpt.py in the directory you run from.

CUTLASS EVT Build

I include the prebuilt .so in cutlass_evt_fusion/ only as a convenience for matching environments. If needed for verification, it can be rebuilt from source with:

git clone https://github.com/NVIDIA/cutlass.git /opt/cutlass
cd /opt/cutlass
git checkout 08185b9c3e90510ee2b656662ed0d53b06d28157
cd -
pip install --no-build-isolation ./cutlass_evt_fusion

Artifact Size Note

The reported artifact sizes above follow the challenge's usual accounting of train_gpt.py code bytes plus compressed model bytes. If I also count the custom cutlass_evt_fusion source files that are shipped here for reproducibility, specifically csrc/gemm_act_grad.cu, csrc/torch_binding.cpp, __init__.py, and setup.py, that adds 8,579 bytes. Under that stricter accounting, the mean artifact size would be 15,965,455 bytes instead of 15,956,876.

mikeapedia · 2026-04-11T01:48:09Z

Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs.

msisovic · 2026-04-11T10:23:46Z

Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs.

Awesome, thanks for noticing! I reran it with 13s reserved, and as expected it didn't noticeably change the score. However, I have noticed that I have on accident ran all three runs with seed 1337, so I have corrected that as well, which was a bit of a hit on the score, but it still clears the bar.

MatoTeziTanka · 2026-04-11T18:16:22Z

Review: PR #1529 — Improved Parallel Residuals

Thanks @msisovic — this picks up your own upstream parallel-residual work (PR #1204) and re-grafts it onto @EthanYangTW's PR #1523 split-lane baseline. Audited head SHA 423eb748.

Compliance

Pure neural. No n-gram, no byte-lookup, no retrieval side-channels. Decoder is parallel-residual lanes with attention reading mixed lane0/x0 and MLP reading raw lane1, final hidden = mean of the two lanes (_final_parallel_hidden). Parallel block at _parallel_block matches the body's description: attn_resid * laneX + attn_post[X] * attn_out, mlp_resid * laneX + mlp_post[X] * mlp_out.
Standard eval. Uses the canonical eval_val path: val_bpb = val_loss / log(2) * (token_count / byte_count), reduced over all validation shards. No custom scoring.
Legal TTT (TTT_ENABLED=1) and the standard HASH_EMBED_ENABLED=1, PARALLEL_RESIDUAL_START=8, MUON_MOMENTUM=0.97 knobs from the Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523 lineage.
Artifact bytes. submission.json reports artifact_bytes_mean=15,957,888 / max 15,959,005, under the 16 MB cap. You also disclose the +8,579 B stricter accounting for the shipped cutlass_evt_fusion/csrc/* — appreciated; the fusion extension is an optional throughput path and train_gpt.py itself does not import it, so the standard accounting still holds for the scored artifact.

CPU smoke test (CT2038 proteus-engine, 2026-04-11)

IMPORT_OK seconds=4.64
HAS_HYPERPARAMETERS=True, HAS_GPT=True
MODEL_DIM=512, NUM_HEADS=8, NUM_LAYERS=11, NUM_LOOPS=2
VOCAB_SIZE=8192, TRAIN_SEQ_LEN=2048
PARALLEL_RESIDUAL_START=7 (default; repro overrides to 8)
TTT_EPOCHS=3, TTT_LR=0.005, QK_GAIN_INIT=5.0, MATRIX_LR=0.022
CODE_BYTES=67198
SMOKE_TEST_PASS

Imports, hyperparameters, and GPT construction all load clean under the shared smoke harness.

One thing worth cross-checking

The PR body's per-seed table and the submission.json seed_results disagree on the legal-TTT BPB:

Seed	PR body	submission.json
1337	1.0746	1.07484573
2024	1.0760	1.07428051
42	1.0754	1.07402553
mean	1.07531639	1.07438392

Both numbers still beat PR #1523 by a comfortable margin (-0.0034 vs -0.0025), but the grader reads submission.json, so the displayed record and the body narrative should probably be aligned for bookkeeping. Not a correctness issue, just a cleanup.

Verdict

Structurally sound, fully reproducible on a matching 8xH100 SXM box, pure neural, legal TTT, and a principled reunification of the split-lane baseline with your own fuller routing. Clean frontier progress from the upstream author — no objections from me.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): SMOKE_TEST_PASS. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA 423eb748.

…ch=1 — val_bpb 1.07636 (3-seed mean) 3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493. Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions), improved parallel residuals from openai#1529, parameter banking from openai#1523. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-12T05:01:26Z

Community Review — Record: ParallelResiduals, 1.0753 BPB

Compliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern)

PR #1529 ("ImprovedParallelResiduals") implements Test-Time Training via eval_val_sliding_ttt() (lines 1324–1476). The TTT logic follows the score-first-per-chunk pattern (PR #1413 pattern) exactly:

Score phase (lines 1399–1426):
Each chunk's windows are scored under torch.no_grad() before any optimizer step:

with torch.no_grad():
    for bi in range(0, len(my_windows), batch_seqs):
        ...
        logits = compiled_logits(x_batch)
        nll = F.cross_entropy(...)
        loss_sum += scored_nll.sum()

Train phase (lines 1427–1458):
Training only runs if not is_last_chunk — the is_last_chunk guard is present at line 1427, preventing optimizer steps on the final scored chunk:

is_last_chunk = ci == num_chunks - 1
if not is_last_chunk and h.ttt_epochs > 0:
    ...
    for _ep in range(h.ttt_epochs):
        optimizer.zero_grad(set_to_none=True)
        loss = base_model(x, y)
        loss.backward()
        optimizer.step()

The optimizer trains on the current chunk after it has been scored, updating the model for the next chunk. This is the canonical legal pattern.

N-gram check: No ctx_hash ^ (target * primes[k]) pattern found anywhere. The only hash embedding is a bigram-style position embedding: h_idx = (prev_ids * 2039 + input_ids) % h.hash_embed_size (line 1354), which is input[t-1] * constant + input[t] — a legal BigramHash-style positional feature used as an additional learned embedding, not a lookup-table override.

No SLOT pattern: No mask+optimize+score-same-region logic detected.

Architecture note: The submission adds custom CUTLASS EVT fusion for MLP backward pass (cutlass_evt_fusion) and a Triton TMA fused leaky-relu-squared kernel — these are pure training-efficiency improvements, not eval shortcuts.

Classification: LEGAL_SCORE_FIRST_TTT_CLEAN

Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under torch.no_grad() before optimizer.step(), with is_last_chunk guard preventing adaptation on the final scored chunk.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually.

msisovic · 2026-04-12T10:59:34Z

Thanks for the review @MatoTeziTanka and for catching the submission.json inconsistency, it was slightly lagging, showing previously incorrectly captured results, and is now updated to match the correct values from the README.

MatoTeziTanka · 2026-04-12T14:50:59Z

Good to see the alignment fix. The review stands — clean TTT, solid architecture work.

The current W2 frontier point is already close to the public best clean-ish line, so the highest-upside architectural import is the improved parallel residual writeback from the openai#1529/openai#1555 family. This patch ports the learned cross-lane lambda mixing into the existing split-lane decoder while keeping the pass-conditioned attention modulation and score-first doc-independent TTT stack intact. Constraint: Single-node budget means the next experiment needs real upside, not another tiny hyperparameter nudge Rejected: Tap-In min_match=1 import first | Higher upside on paper, but much riskier on bytes, runtime, and review surface than improved parallel residuals Confidence: medium Scope-risk: moderate Directive: If this lane regresses, treat improved parallel residuals as non-additive with the current W2 modulation stack rather than trying to rescue it with more tuning Tested: python3 -m py_compile train_gpt.py; lsp diagnostics reported no file-level errors Not-tested: GPU score, bytes, and runtime on the integrated lane

msisovic added 2 commits April 11, 2026 02:02

Add parallel residual CUTLASS EVT record

1ac984f

Rename record title to Parallel Residuals

5f7f3fc

msisovic changed the title ~~Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats vs PR #1523~~ Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Apr 11, 2026

msisovic changed the title ~~Record: ParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523~~ Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523 Apr 11, 2026

msisovic added 2 commits April 11, 2026 02:07

title change

652eedb

Rename and clean improved parallel residuals record

5baba3f

Update improved parallel residuals record metrics

f93838b

msisovic changed the title ~~Record: ImprovedParallelResiduals, 1.0744 BPB / 2.7752 nats, -0.0034 BPB / -0.0088 nats vs PR #1523~~ Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523 Apr 11, 2026

Update improved parallel residuals seed logs

07a67d4

bigbag mentioned this pull request Apr 11, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + LR 0.03 + Legal TTT — val_bpb 1.07785 (3-seed mean) #1541

Open

msisovic added 2 commits April 11, 2026 16:23

Clarify CUTLASS EVT verification path in record README

a1d3303

Clarify CUTLASS artifact accounting

423eb74

samacqua mentioned this pull request Apr 11, 2026

Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530

Open

andrewbaggio1 mentioned this pull request Apr 11, 2026

Record: TMA Megakernel + Improved Parallel Residuals + Tap-In min_match=1 — val_bpb 1.07636 (3-seed mean) #1555

Open

11 tasks

dexhunter mentioned this pull request Apr 11, 2026

Non-record: SP8192 + VarLen Attention + Doc-Independent LoRA TTT + Banking + Muon 0.97 — val_bpb 1.07747 (3-seed mean) #1536

Open

ndokutovich mentioned this pull request Apr 12, 2026

Record: SP8192 + Improved Parallel Residuals + Muon 0.97 + TTT 5ep + N-gram Tilt + Hessian SDClip — val_bpb 1.07730 #1557

Open

7 tasks

Update submission.json to match rerun metrics

e201c45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523#1529

Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523#1529
msisovic wants to merge 9 commits intoopenai:mainfrom
msisovic:record-parallel-residuals-cutlass-evt

msisovic commented Apr 11, 2026 •

edited

Loading

Uh oh!

mikeapedia commented Apr 11, 2026

Uh oh!

msisovic commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

msisovic commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

msisovic commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Record: Improved Parallel Residuals

Results (8xH100 80GB SXM, 600s)

Reproducibility

CUTLASS EVT Build

Artifact Size Note

Uh oh!

mikeapedia commented Apr 11, 2026

Uh oh!

msisovic commented Apr 11, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Review: PR #1529 — Improved Parallel Residuals

Compliance

CPU smoke test (CT2038 proteus-engine, 2026-04-11)

One thing worth cross-checking

Verdict

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Record: ParallelResiduals, 1.0753 BPB

Uh oh!

msisovic commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msisovic commented Apr 11, 2026 •

edited

Loading