Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks) by aazizyan · Pull Request #855 · openai/parameter-golf

aazizyan · 2026-03-26T14:36:10Z

Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study

20 ablation runs across 5 series testing 8 techniques for stabilizing depth recurrence under 16MB int8+zlib quantization. Three novel stabilization techniques enable 3-loop recurrence for the first time in competition history. Five additional techniques tested with documented positive and negative results.

Best Results

Config	Post-Q BPB	Q-gap	Artifact	Note
1+4×3+1 full share + FiLM + sinusoidal depth (Run T)	1.2624	+0.0073	10.7MB	Best practical config, ~4.8MB headroom
1+4×2+1 shared attn + unique MLPs (Run L)	1.2406	+0.0073	14.7MB	Best absolute, but no room for SOTA

Techniques That Work

Technique	Delta	Cost
Output-LN	−0.007 BPB	Zero
Prelude-coda	−0.016 BPB	More unique params
Birkhoff mixing	Enables 3-loop stability	Zero
Timestep scaling (γ)	Q-gap −26-30%	~8KB FP16
FiLM bias (β)	−0.003 BPB	~8KB FP16
Sinusoidal depth encoding	Q-gap −0.0005	Zero (non-persistent buffer)

Techniques That Don't Work (documented negative results)

Technique	Result	Why
Learned depth embeddings	+0.0014 BPB worse	Throughput overhead, values stayed near zero
Unique input norm gains	+0.0004 BPB worse	MLP gains didn't move from 1.0, redundant with Output-LN
Unique MLPs (attn-only sharing)	−0.026 BPB best result	Too expensive: 14.7MB artifact, no SOTA headroom

Key Findings

Timestep scaling helps quantization, not training — float16 passthrough params bypass int8, reducing Q-gap 26-30% with zero pre-quant BPB effect
MLP needs weight-space differentiation, not input-space modulation — unique MLPs give −0.026 BPB, but cheap input controls (norms, depth embeddings) give nothing
ALBERT's finding confirmed at 512d — attention sharing is nearly free, FFN sharing causes most degradation
Q-gap scales with training duration — screening underestimates quantization problems 4-7×
Sinusoidal > learned for depth encoding — zero cost, same Q-gap benefit, 0.0015 BPB better due to throughput savings

Validated Stack for SOTA Integration

Output-LN + Birkhoff mixing + FiLM scale+shift + sinusoidal depth encoding. Total FP16 passthrough: ~50KB. Artifact: ~10.7MB. Headroom for SOTA features: ~4.8MB.

20 Runs Across 5 Series

Series 1 (7 screening runs): technique isolation on 1×H100
Series 2 (5 full-scale runs): 8×H100 validation, Run K = first viable 3-loop (1.2659)
Series 3 (4 runs): FiLM bias (−0.003) + attention-only sharing (−0.026 but too expensive)
Series 4 (4 runs): learned depth embeddings + unique norms (negative result)
Series 5 (1 run): sinusoidal depth encoding (free, marginal Q-gap benefit)

See research_notes.md for theory, 14 citations, and detailed analysis.

Credits

Built on insights from:

PR Non-record: Depth Recurrence in Parameter-Constrained Transformers — What Works, What Doesn't, and Why #363 (@evangelinehelsinki) — quantization error amplification measurement
andrew (Discord) — attention-only sharing suggestion, FiLM bias idea
ALBERT, MoEUT, Universal Transformer, Relaxed Recursive Transformer, Huginn

…niques

aazizyan · 2026-03-26T14:41:29Z

Some untested directions that might be worth exploring:

These three techniques on shallow recurrence (repeat 1-2 layers on the SOTA stack) — the Q-gap reduction from timestep scaling could be meaningful at the frontier
Int6/GPTQ interaction with Birkhoff mixing — sigmoid values in [0,1] should quantize cleanly at any bit width
Output-LN on non-recurrent models — may help even without weight sharing, since it lets MLP see unnormalized inputs while bounding output
Gamma cap ablation (2.0 vs 4.0 vs 8.0) — the cap value was chosen empirically, not optimized
QAT combined with Birkhoff + Output-LN + timestep scaling — QAT has been tried for recurrence before, but not with these stabilization techniques in place

…ve result)

…series 4-5

aazizyan · 2026-04-02T18:41:23Z

Follow-up: PR #1204 (@msisovic, 1.1063 BPB) independently confirms two findings from this study — attention sharing is free while MLP needs unique weights (they use REPEAT_UNTIE_MLP=full), and shallow recurrence beats deep. Techniques from this PR not yet tested on their stack: Output-LN, Birkhoff mixing, FiLM scale+shift.

aazizyan added 7 commits March 26, 2026 17:57

feat: add modified training script with recurrence stabilization tech…

8dc11e2

…niques

feat: add screening experiment scripts and logs (7 runs, 2000 steps)

43f7438

feat: add full-scale experiment scripts and logs (5 runs, 600s 8xH100)

5603ee0

docs: add research notes with theory and citations

ab1db1b

chore: add submission metadata and primary run log

52442c9

docs: add README for PR submission

37e6422

docs: polish README and research notes for PR submission

a3a8613

vimeto mentioned this pull request Mar 29, 2026

Record: Depth-Recurrent UT + Rank-1 LoRA Per-Iteration Adaptation — val_bpb 1.3342 #1096

Draft

aazizyan added 3 commits March 31, 2026 19:01

feat: add FiLM bias and attention-only sharing ablations (Series 3)

1cffe4d

feat: add depth embeddings + unique norms ablations (Series 4, negati…

6a9eeaa

…ve result)

feat: add sinusoidal depth encoding, unique norms, complete ablation …

5e31104

…series 4-5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855
aazizyan wants to merge 10 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale

aazizyan commented Mar 26, 2026 •

edited

Loading

Uh oh!

aazizyan commented Mar 26, 2026

Uh oh!

aazizyan commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aazizyan commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study

Best Results

Techniques That Work

Techniques That Don't Work (documented negative results)

Key Findings

Validated Stack for SOTA Integration

20 Runs Across 5 Series

Credits

Uh oh!

aazizyan commented Mar 26, 2026

Uh oh!

aazizyan commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aazizyan commented Mar 26, 2026 •

edited

Loading