Skip to content

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855

Open
aazizyan wants to merge 10 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale
Open

Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855
aazizyan wants to merge 10 commits intoopenai:mainfrom
aazizyan:research/RecurrenceFix_3Loop_Birkhoff_OutputLN_TimestepScale

Conversation

@aazizyan
Copy link
Copy Markdown

@aazizyan aazizyan commented Mar 26, 2026

Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study

20 ablation runs across 5 series testing 8 techniques for stabilizing depth recurrence under 16MB int8+zlib quantization. Three novel stabilization techniques enable 3-loop recurrence for the first time in competition history. Five additional techniques tested with documented positive and negative results.

Best Results

Config Post-Q BPB Q-gap Artifact Note
1+4×3+1 full share + FiLM + sinusoidal depth (Run T) 1.2624 +0.0073 10.7MB Best practical config, ~4.8MB headroom
1+4×2+1 shared attn + unique MLPs (Run L) 1.2406 +0.0073 14.7MB Best absolute, but no room for SOTA

Techniques That Work

Technique Delta Cost
Output-LN −0.007 BPB Zero
Prelude-coda −0.016 BPB More unique params
Birkhoff mixing Enables 3-loop stability Zero
Timestep scaling (γ) Q-gap −26-30% ~8KB FP16
FiLM bias (β) −0.003 BPB ~8KB FP16
Sinusoidal depth encoding Q-gap −0.0005 Zero (non-persistent buffer)

Techniques That Don't Work (documented negative results)

Technique Result Why
Learned depth embeddings +0.0014 BPB worse Throughput overhead, values stayed near zero
Unique input norm gains +0.0004 BPB worse MLP gains didn't move from 1.0, redundant with Output-LN
Unique MLPs (attn-only sharing) −0.026 BPB best result Too expensive: 14.7MB artifact, no SOTA headroom

Key Findings

  1. Timestep scaling helps quantization, not training — float16 passthrough params bypass int8, reducing Q-gap 26-30% with zero pre-quant BPB effect
  2. MLP needs weight-space differentiation, not input-space modulation — unique MLPs give −0.026 BPB, but cheap input controls (norms, depth embeddings) give nothing
  3. ALBERT's finding confirmed at 512d — attention sharing is nearly free, FFN sharing causes most degradation
  4. Q-gap scales with training duration — screening underestimates quantization problems 4-7×
  5. Sinusoidal > learned for depth encoding — zero cost, same Q-gap benefit, 0.0015 BPB better due to throughput savings

Validated Stack for SOTA Integration

Output-LN + Birkhoff mixing + FiLM scale+shift + sinusoidal depth encoding. Total FP16 passthrough: ~50KB. Artifact: ~10.7MB. Headroom for SOTA features: ~4.8MB.

20 Runs Across 5 Series

  • Series 1 (7 screening runs): technique isolation on 1×H100
  • Series 2 (5 full-scale runs): 8×H100 validation, Run K = first viable 3-loop (1.2659)
  • Series 3 (4 runs): FiLM bias (−0.003) + attention-only sharing (−0.026 but too expensive)
  • Series 4 (4 runs): learned depth embeddings + unique norms (negative result)
  • Series 5 (1 run): sinusoidal depth encoding (free, marginal Q-gap benefit)

See research_notes.md for theory, 14 citations, and detailed analysis.

Credits

Built on insights from:

@aazizyan
Copy link
Copy Markdown
Author

Some untested directions that might be worth exploring:

  • These three techniques on shallow recurrence (repeat 1-2 layers on the SOTA stack) — the Q-gap reduction from timestep scaling could be meaningful at the frontier
  • Int6/GPTQ interaction with Birkhoff mixing — sigmoid values in [0,1] should quantize cleanly at any bit width
  • Output-LN on non-recurrent models — may help even without weight sharing, since it lets MLP see unnormalized inputs while bounding output
  • Gamma cap ablation (2.0 vs 4.0 vs 8.0) — the cap value was chosen empirically, not optimized
  • QAT combined with Birkhoff + Output-LN + timestep scaling — QAT has been tried for recurrence before, but not with these stabilization techniques in place

@aazizyan aazizyan changed the title Non-Record: First Viable 3-Loop Recurrence — Birkhoff + Output-LN + Timestep Scaling (val_bpb=1.2659, 14 eff layers from 6 unique blocks) Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks) Apr 2, 2026
@aazizyan
Copy link
Copy Markdown
Author

aazizyan commented Apr 2, 2026

Follow-up: PR #1204 (@msisovic, 1.1063 BPB) independently confirms two findings from this study — attention sharing is free while MLP needs unique weights (they use REPEAT_UNTIE_MLP=full), and shallow recurrence beats deep. Techniques from this PR not yet tested on their stack: Output-LN, Birkhoff mixing, FiLM scale+shift.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant