Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523#1529
Record: ParallelResiduals, 1.0753 BPB / 2.7777 nats, -0.0025 BPB / -0.0064 nats vs PR #1523#1529msisovic wants to merge 9 commits intoopenai:mainfrom
Conversation
|
Awesome submission! I was just looking through your seed1337 log and I noticed your gptq_reserve_seconds is set to 12.0 in row 21 but the hessian collection took 12.6 seconds (row 204). So you may want to bump the reserve seconds up to 13 just to be safe. I didn't look through the other logs to see if it happened in all 3 runs. |
Awesome, thanks for noticing! I reran it with 13s reserved, and as expected it didn't noticeably change the score. However, I have noticed that I have on accident ran all three runs with seed 1337, so I have corrected that as well, which was a bit of a hit on the score, but it still clears the bar. |
Review: PR #1529 — Improved Parallel ResidualsThanks @msisovic — this picks up your own upstream parallel-residual work (PR #1204) and re-grafts it onto @EthanYangTW's PR #1523 split-lane baseline. Audited head SHA Compliance
CPU smoke test (CT2038 proteus-engine, 2026-04-11)Imports, hyperparameters, and One thing worth cross-checkingThe PR body's per-seed table and the
Both numbers still beat PR #1523 by a comfortable margin ( VerdictStructurally sound, fully reproducible on a matching 8xH100 SXM box, pure neural, legal TTT, and a principled reunification of the split-lane baseline with your own fuller routing. Clean frontier progress from the upstream author — no objections from me. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): SMOKE_TEST_PASS. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code at SHA |
…ch=1 — val_bpb 1.07636 (3-seed mean) 3-seed mean 1.07636 BPB (std 0.0006), delta -0.00897 nats vs merged SOTA openai#1493. Novel: TMA fused MLP kernel, Tap-In unigram matching (min_match=1, fires 21% of positions), improved parallel residuals from openai#1529, parameter banking from openai#1523. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: ParallelResiduals, 1.0753 BPBCompliance: LOOKS CLEAN — legal score-first-per-chunk TTT (PR #1413 pattern) PR #1529 ("ImprovedParallelResiduals") implements Test-Time Training via Score phase (lines 1399–1426): with torch.no_grad():
for bi in range(0, len(my_windows), batch_seqs):
...
logits = compiled_logits(x_batch)
nll = F.cross_entropy(...)
loss_sum += scored_nll.sum()Train phase (lines 1427–1458): is_last_chunk = ci == num_chunks - 1
if not is_last_chunk and h.ttt_epochs > 0:
...
for _ep in range(h.ttt_epochs):
optimizer.zero_grad(set_to_none=True)
loss = base_model(x, y)
loss.backward()
optimizer.step()The optimizer trains on the current chunk after it has been scored, updating the model for the next chunk. This is the canonical legal pattern. N-gram check: No No SLOT pattern: No mask+optimize+score-same-region logic detected. Architecture note: The submission adds custom CUTLASS EVT fusion for MLP backward pass (cutlass_evt_fusion) and a Triton TMA fused leaky-relu-squared kernel — these are pure training-efficiency improvements, not eval shortcuts. Classification: LEGAL_SCORE_FIRST_TTT_CLEAN Verdict: LOOKS CLEAN — legal TTT implementation matching the PR #1413 (dexhunter) pattern: each chunk scored under Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation, under-16MB artifact cap, ≤600s train + ≤600s eval on 8×H100 SXM). TTT implementation follows the legal score-first discipline. Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source, cross-checked against deterministic AST classifier. If this review misread your code, please call it out so I can re-audit manually. |
|
Thanks for the review @MatoTeziTanka and for catching the |
|
Good to see the alignment fix. The review stands — clean TTT, solid architecture work. |
The current W2 frontier point is already close to the public best clean-ish line, so the highest-upside architectural import is the improved parallel residual writeback from the openai#1529/openai#1555 family. This patch ports the learned cross-lane lambda mixing into the existing split-lane decoder while keeping the pass-conditioned attention modulation and score-first doc-independent TTT stack intact. Constraint: Single-node budget means the next experiment needs real upside, not another tiny hyperparameter nudge Rejected: Tap-In min_match=1 import first | Higher upside on paper, but much riskier on bytes, runtime, and review surface than improved parallel residuals Confidence: medium Scope-risk: moderate Directive: If this lane regresses, treat improved parallel residuals as non-additive with the current W2 modulation stack rather than trying to rescue it with more tuning Tested: python3 -m py_compile train_gpt.py; lsp diagnostics reported no file-level errors Not-tested: GPU score, bytes, and runtime on the integrated lane
Record: Improved Parallel Residuals
val_bpb: 1.07531639 (3-seed mean, std 0.0006) | 2.77765390 nats | ~15.96 MB | 8xH100 SXM, 600s | Legal TTT
This submission starts from PR #1523. Most of the newer submissions moved away from my fuller parallel-residual formulation and settled on a simpler GPT-J-style split-lane decoder. This version keeps the strong parts of that newer baseline and reintroduces the useful parts of my parallel residual implementation.
The key architectural change relative to PR #1523 is in the decoder after the split point. Attention and MLP read from different lanes, but neither sublayer writes back immediately. Instead, both outputs are accumulated into the two lanes together at the end of the block:
That keeps the GPT-J-style parallel-in-time update, while restoring the richer learned routing between the attention and MLP lanes. The other important part is that decoder U-Net skips are still written only into
lane0, which preserves the cheaper and more stable skip path from the newer baseline. Attention reads the mixedlane0/x0path, while MLP reads rawlane1. Final output uses the mean of the two lanes.In practice, that is pretty much the only modeling change here versus PR #1523, together with moving
PARALLEL_RESIDUAL_STARTfrom the baseline's7to8. I ablated that start-layer change separately on top of the plain PR #1523 baseline, without my fuller parallel residual routing changes, and it gave a mild regression on its own. The other notable requirement is that I needed thecutlass_evt_fusionpath to recover the full throughput. PR #1523's logged runs were run with that path available, but it was not included in the submission folder itself. Without it, the wallclock cap gives up too many steps and the gain disappears.Results (8xH100 80GB SXM, 600s)
Reproducibility
The
cutlass_evt_fusion/directory should live alongsidetrain_gpt.pyin the directory you run from.CUTLASS EVT Build
I include the prebuilt
.soincutlass_evt_fusion/only as a convenience for matching environments. If needed for verification, it can be rebuilt from source with:Artifact Size Note
The reported artifact sizes above follow the challenge's usual accounting of
train_gpt.pycode bytes plus compressed model bytes. If I also count the customcutlass_evt_fusionsource files that are shipped here for reproducibility, specificallycsrc/gemm_act_grad.cu,csrc/torch_binding.cpp,__init__.py, andsetup.py, that adds 8,579 bytes. Under that stricter accounting, the mean artifact size would be 15,965,455 bytes instead of 15,956,876.