Skip to content

RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters#1181

Open
Tanush1912 wants to merge 3 commits intoopenai:mainfrom
Tanush1912:submission/recur-lora-slope09-qkgain4
Open

RecurLoRA: Quantization-Stable Shallow Recurrence with Low-Rank Corrective Adapters#1181
Tanush1912 wants to merge 3 commits intoopenai:mainfrom
Tanush1912:submission/recur-lora-slope09-qkgain4

Conversation

@Tanush1912
Copy link
Copy Markdown

Summary

Why this direction

Weight sharing has consistently failed in this competition due to quantization error accumulation across repeated layers (e.g. PR #363: +4.3 BPB at 3 cycles).

However, PR #686 demonstrated that shallow recurrence (<=2 repeats) remains stable under int6 quantization (~1.1182 BPB), suggesting that limited reuse is viable.

RecurLoRA builds on this by introducing per-pass low-rank corrective adapters:

  • Shared base weights capture global structure
  • Rank-2 LoRA adapters specialize each pass (attention only)
  • RMSNorm + learnable alpha mitigate residual drift

This enables increased effective depth (11 -> 13 layers) without incurring the instability of deep recurrence, effectively reallocating parameters from duplicated layers into increased depth under a fixed 16MB budget.

Status

Implementation complete and validated for:

  • Forward/backward correctness
  • Gradient flow across recurrent passes (warm-initialized LoRA: A and B active from step 1)
  • Parameter budget compliance (28KB overhead)

Full training runs (3 seeds + ablations) queued pending compute.

Test plan

Novel contribution: shallow recurrence (layers 4,5 repeated once each)
with rank-2 LoRA corrections on attention projections, RMSNorm before
repeat, and learnable alpha scaling. 13 virtual layers from 11 physical
layers at 28KB (0.18%) parameter overhead.

Hyperparameter changes from PR openai#1179 base (1.1105 BPB):
- NEGATIVE_SLOPE: 0.5 -> 0.9 (validated +0.013 BPB in issue openai#140)
- QK_GAIN_INIT: 1.5 -> 4.0 (validated +0.006 BPB in PR openai#1176)
- TTT_ENABLED: 1 (score-first, legal variant)
- WARMDOWN_ITERS: 4000 (extended from 3500)
- BIGRAM_DIM: 160 (from 112)

Status: WIP - awaiting compute for 3-seed validation runs.
Both A and B matrices now initialized with N(0, 1e-3) instead of one
being zero. This ensures all LoRA parameters receive gradients from
step 1, critical in a 600s training budget where delayed activation
wastes precious optimization steps.

Alpha default raised from 0.4 to 0.6 to amplify early correction
signal.
- Rename submission folder to RecurLoRA_Slope09_QKGain4_TTT
- Rewrite README: lead with architectural contribution, add scaling
  hypothesis, constraint-aware framing, prior failure table
- Fix LoRA gradient flow description (warm-init, not cold-start)
- Update submission.json title to match
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant