Skip to content

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA#1204

Open
msisovic wants to merge 20 commits intoopenai:mainfrom
msisovic:hyperconnections_submission
Open

Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA#1204
msisovic wants to merge 20 commits intoopenai:mainfrom
msisovic:hyperconnections_submission

Conversation

@msisovic
Copy link
Copy Markdown

@msisovic msisovic commented Apr 1, 2026

Record: Parallel Residuals + Mini Depth Recurrence

val_bpb: 1.1063 (3-seed mean, std 0.0017) | 1.8679 nats | ~15.94 MB | 8×H100 SXM, 600s | No TTT

I started this submission from PR #1179, which gave me the base training stack I wanted to iterate on here. On top of that, I ported over the mixed-quantization and autoregressive GPTQ path from PR #1105. That was partly a modeling choice and partly a practical one: AR self-generated GPTQ calibration was already a known acceptable path for this challenge, and it let me avoid having the quantization step depend on last-minute training-data access in a way that makes the 10-minute budget awkward to manage.

Results (8×H100 80GB SXM, 600s, no TTT)

Seed Steps ms/step Post-EMA BPB Sliding BPB val_loss (nats) Artifact
1337 6,242 96.1 1.1232 1.1066 1.8684 15,942,395
42 6,248 96.0 1.1235 1.1077 1.8704 15,919,617
2024 6,240 96.2 1.1216 1.1044 1.8648 15,946,657
Mean 6,243 96.1 1.1228 1.1063 1.8679 15,936,223

Comparison baseline PR #1179: 1.11053346 BPB (1.87508426 nats).
This run's exact 3-seed mean: 1.10625353 BPB (1.86785780 nats).
Delta vs PR #1179: -0.00722646 nats (-0.00427993 BPB).

Current merged SOTA (2026-03-25 AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112): 1.11473509 BPB (1.88217853 nats).
Delta vs current merged SOTA: -0.01432073 nats (-0.00848156 BPB).

Parallel residuals

I took this idea from my modded-nanogpt record in KellerJordan/modded-nanogpt PR #230 and adapted it to this codebase.

Chronologically, this change actually came last. I am putting it first here because it ended up being the single biggest gain on top of the base + mini-depth-recurrence stack: relative to the under-budget mini-DR baseline (1.8705 val loss / 1.1078 BPB in sliding-window eval), it improved things by roughly another 0.0037 nats and 0.0022 BPB, landing around 1.8668 / 1.1056. But this is still a one-sample observation, so I do not want to overstate the precision of that delta.

Starting from layer 7, attention and MLP read from different residual lanes, and each sublayer learns how strongly to write back into both lanes.

One interesting pattern is that the learned routing is quite asymmetric, which is also what I saw in the modded-nanogpt run: MLP barely writes back into attention's residual stream, especially in the deeper partitioned layers.

Virtual layer Physical layer attn_to_attn attn_to_mlp mlp_to_attn mlp_to_mlp
9 7 1.3030 0.8484 0.3851 1.3043
10 8 2.0972 0.8114 0.0557 1.7884
11 9 0.4523 0.9251 0.0098 0.2692
12 10 1.0153 -0.0160 0.0844 0.0844

Despite that pattern, I also tried the followup optimization from modded-nanogpt PR #241, where MLP simply does not write to the attention lane at all in order to get a speedup. In this repo that brought a slight regression, so I kept the original parallel-residual formulation instead.

Mini Depth Recurrence

Note: Most of the recurrence sweeps under this section were run on an older baseline, and I later transferred the final recipe over to the newer baseline used for this submission.

After some early failed attempts at full recurrence, I backed off to a much smaller version of the idea: instead of recurring the whole stack, I only repeated a couple of middle layers. I had already convinced myself from over-budget probes that extra depth was real, so the question became how much of that gain I could recover with minimal weight sharing.

The main sweeps were simple but informative. Repeating one layer helped, repeating two consecutive layers helped more, and repeating three was already losing to the step-time penalty. I also swept the position of the repeated pair and found a clear sweet spot at layers 4,5, right around the U-Net hinge point. So the useful regime here was not “add recurrence everywhere”, it was “reuse a very small part of the middle of the stack.”

The next improvement was to turn recurrence on only mid training. Since repeated layers slow every step down, I trained the cheaper non-recurrent model first and only activated recurrence later. In the earlier sweep, always-on recurrence reached about 1.1163 BPB post-TTT, while delayed recurrence improved that to about 1.1153, with RECUR_START_STEP=3000 working well.

Finally, because mixed precision left me some parameter budget headroom, I found that the best place to spend it was untying the repeated MLPs while leaving the rest of the recurrent block shared. That gave another small but real improvement. Roughly speaking, mini depth recurrence was worth about 0.003-0.004 nats and 0.002-0.003 BPB over the best under-budget non-recurrent depth probe I had at the time.

Reproducibility

The main training runs for this submission used the following command:

SEED=$SEED POST_GPTQ_EVAL_ONLY=0 BIGRAM_DIM=112 MIXED_QUANT=1 N_INT6_LAYERS=32 NUM_LAYERS=11 RECUR_LAYERS=4,5 RECUR_START_STEP=3000 REPEAT_UNTIE_MLP=full REPEAT_UNTIE_MLP_LAYERS=4,5 DISABLE_LAYER0_ATTN=1 PARALLEL_RESIDUAL=1 PARALLEL_START_LAYER=7 torchrun --standalone --nproc_per_node=8 train_gpt.py

brotli also needs to be installed for the final artifact path. It is included in the copied requirements.txt.

PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 1, 2026
Architectural innovations from PR openai#1204 (1.1063 BPB record):
- QK_GAIN_INIT=4.0 (from PR openai#1125 sweep, -0.006 BPB)
- Parallel Residuals: dual-lane from physical layer 7+
  - Attn reads lane0, MLP reads lane1, learned cross-lane writes
  - parallel_post_lambdas [N,2,2], parallel_resid_lambdas [N,2]
- Mini Depth Recurrence: repeat layers 4,5 between encoder/decoder
  - Delayed activation at step 3000 (avoids disrupting early training)
  - Tied MLP weights (no extra params, keeps model within 16MB)
- Bigram dim reduced 128->112 for budget headroom
- Refactored forward into _run_backbone() for DRY encoder/decoder/parallel
@msisovic
Copy link
Copy Markdown
Author

msisovic commented Apr 1, 2026

PS: One additional thing worth noting is that in order to save a bit more space for untying MLP in recurrence, apart from mixed quant, I dropped layer 0 attention. This was inspired by looking at the weights with which the layers/blocks contribute back to the residual connection. Layer 0 attention was about an OOM lower than the rest, and dropping it resulted in a very minor degradation, while saving on params.

@valerio-oai
Copy link
Copy Markdown
Contributor

Hi! This looks potentially interesting: could you add train_gpt.py to make this reviewable?

@msisovic
Copy link
Copy Markdown
Author

msisovic commented Apr 2, 2026

@valerio-oai Hi, thanks, glad that you found it interesting! I don't know how I missed that part, train_gpt.py is now included, and you can verify it's the same as the code prepended to the log files.

Edit: The base I was working from was a minified training script designed to save space, meaning the code doesn't really have space to breathe, multiple lines are merged into one, but I don't think it's too unreadable. LMK if this is a problem.

gptq:collecting hessians from autoregressive data...
gptq:collected hessians for 70 layers (AR self-gen)
gptq:done in 241.8s
wallclock:post_gptq total_elapsed:858.2s train_budget:600.0s
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for reviewers: this is a leftover log back from when gptq calibration was done on training data, so I had to double-check everything up until this point fits in the training budget. I have since switched to AR generation of the calibration sequences, so the log is irrelevant.

@nestamidavaine
Copy link
Copy Markdown

nestamidavaine commented Apr 2, 2026

@msisovic Hi, this is a very nice approach. I came to the similar conclusion that the step time penalty is too high for high pass recurrence, but the capacity increase is there. So I also grow the amount of recurrence steps over time in #1231. I actually found that with the stabilizations I added to reduce error build-up when training a quantized mode, TTT can be quite effective. I added a regularization on the magnitudes of the hidden states of the recurring block. Maybe our approaches can be combined in some way.

Maybe by only adding the regularization you can do 3x or 4x recurrence passes.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 2, 2026
…th recurrence

All 9 audit findings addressed (2 CRITICAL, 4 WARNING, 3 INFO):
- C1: Route gradient test validates nonzero after 1 optimizer step
- C2: Mixed quant budget (INT4 MLP, INT6 attn, INT8 rest) fits 15.3MB
- W1: Separate resid_mix_mlp for parallel MLP lane
- W2: Assert parallel doesn't start inside encoder
- W3: Document post-skip recurrence is intentional per PR openai#1204
- W4: SmearGate caches self.fc(x)
- W5: Extract _run_layers() to deduplicate forward/forward_logits
- I2: Test recurrent+parallel overlap on layer 7
- I3: Learnable lane_merge parameter

6 tests pass on CPU in ~21s. Budget verified at 15.3MB for 30.2M params.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 2, 2026
…val_bpb 1.0876 (3-seed mean)

3-seed exact mean: 1.08759808 BPB (std 0.00036912)
Beats merged SOTA PR openai#1019 (1.1147 BPB) by -0.0271 BPB
Welch t = -91.92, df = 3.99, p << 0.01

Built on our PRs openai#549 and openai#1019. Adds Scylla tokenizer (PR openai#1143,
@simon-marcus), parallel residuals + mini depth recurrence (PR openai#1204,
@msisovic), mixed INT5/INT6 quantization + brotli, legal score-first TTT.

All artifacts under 16MB. 8xH100 SXM, 600s training + ~495s TTT eval.
MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 3, 2026
… + Legal TTT — val_bpb 1.0819

3-seed mean: 1.0819 BPB (std: 0.00088)
Seeds: 42=1.0808, 1337=1.0829, 2024=1.0821

Integration of community techniques with full attribution:
- Base: PR openai#549, openai#1019 by @abaybektursun
- Scylla tokenizer: PR openai#1143 by @simon-marcus
- Parallel residuals + depth recurrence: PR openai#1204 by @msisovic
- Legal TTT: PR openai#461 by @Christopher-Lee-McClendon

Our engineering: mixed INT5/INT6 quantization, learnable lane merge,
Scylla retokenization pipeline, integration work, CPU e2e test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants