Non-Record: Depth Recurrence Research — 20 Ablation Runs, 8 Techniques, 5 Series (best val_bpb=1.2624, 14 eff layers from 6 unique blocks)#855
Open
aazizyan wants to merge 10 commits intoopenai:mainfrom
Conversation
Author
|
Some untested directions that might be worth exploring:
|
Author
|
Follow-up: PR #1204 (@msisovic, 1.1063 BPB) independently confirms two findings from this study — attention sharing is free while MLP needs unique weights (they use REPEAT_UNTIE_MLP=full), and shallow recurrence beats deep. Techniques from this PR not yet tested on their stack: Output-LN, Birkhoff mixing, FiLM scale+shift. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depth Recurrence in Parameter-Constrained Transformers: A Systematic Study
20 ablation runs across 5 series testing 8 techniques for stabilizing depth recurrence under 16MB int8+zlib quantization. Three novel stabilization techniques enable 3-loop recurrence for the first time in competition history. Five additional techniques tested with documented positive and negative results.
Best Results
Techniques That Work
Techniques That Don't Work (documented negative results)
Key Findings
Validated Stack for SOTA Integration
Output-LN + Birkhoff mixing + FiLM scale+shift + sinusoidal depth encoding. Total FP16 passthrough: ~50KB. Artifact: ~10.7MB. Headroom for SOTA features: ~4.8MB.
20 Runs Across 5 Series
See research_notes.md for theory, 14 citations, and detailed analysis.
Credits
Built on insights from: