Skip to content

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean)#1396

Closed
erichroepke wants to merge 1 commit intoopenai:mainfrom
erichroepke:record/combined-3layer-recur-parallel-resid-brotli
Closed

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean)#1396
erichroepke wants to merge 1 commit intoopenai:mainfrom
erichroepke:record/combined-3layer-recur-parallel-resid-brotli

Conversation

@erichroepke
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1067 (3-seed mean, std 0.0013) — beats SOTA (1.1147) by 0.008 BPB
  • Artifact: 13.87 MB (2.13 MB headroom under 16MB cap)
  • Clean submission — no TTT, no SLOT, no n-gram cache
  • 8×H100 SXM, 600s training

Results

Seed Sliding BPB Artifact
1337 1.1080 13,866,319
42 1.1055 13,871,505
2025 1.1067 13,861,924
Mean 1.1067

What This Is

I'm a documentary filmmaker, not an ML engineer. I used Claude Opus 4.6 as a co-author to systematically analyze all open PRs in the competition, identified that @Omrigotlieb's #1344 and @dexhunter's #1392 each had techniques the other was missing, and merged them into a single stack that neither had tested.

The strategic decisions were mine. The code comprehension and merge engineering were AI-assisted. This is my first ML submission of any kind.

Novel Contribution

First submission combining 3-layer depth recurrence (from #1344) with parallel residuals (from #1392). Neither PR tested this combination. 2.13 MB of unused artifact headroom identified as future optimization opportunity.

Techniques Combined

Technique Source Setting
3-layer depth recurrence @Omrigotlieb #1344 RECUR_LAYERS=3,4,5
Parallel residuals @dexhunter #1392 PARALLEL_START_LAYER=7
Polar Express Newton-Schulz @Omrigotlieb #1344 4 minimax-optimal steps
MuonEq-R #1344 Row-norm before NS
Brotli + byte-shuffle @dexhunter #1392 Replaces lzma
QK-Gain 5.0 @dexhunter #1392 Per-head gain
WD=0.105 @Omrigotlieb #1344 Higher WD for compression
Full Hessian GPTQ int6 Both Standard
No TTT Clean Removed for compliance

Reproduction

VOCAB_SIZE=1024 QK_GAIN_INIT=5.0 RECUR_LAYERS="3,4,5" \
PARALLEL_START_LAYER=7 MUON_WD=0.105 MUON_EQ_R=1 \
SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test plan

  • 3-seed validation (1337, 42, 2025)
  • All artifacts under 16,000,000 bytes (max: 13,871,505)
  • Clean — no TTT, no SLOT, no n-gram cache
  • Beats merged SOTA by 0.008 BPP

Credits

The techniques belong to the people who invented them. I combined their work.

🤖 Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

…r Express + Brotli — val_bpb 1.1067 (3-seed mean)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Robby955
Copy link
Copy Markdown

Robby955 commented Apr 6, 2026

Your submission score is higher (worse) then the PRs (1344,1392) you cited for it, meaning it actually made the model worse, not better.

@erichroepke
Copy link
Copy Markdown
Author

Withdrawing — ran with SP1024 instead of SP4096, which negated the combination gains. Will resubmit with proper SP4096 data once generated. Thanks for the feedback.

@erichroepke erichroepke closed this Apr 6, 2026
@erichroepke erichroepke deleted the record/combined-3layer-recur-parallel-resid-brotli branch April 6, 2026 04:17
@erichroepke
Copy link
Copy Markdown
Author

Hey — just submitted an updated version as PR #1416 (1.07948 BPB). This one combines @clarkkev's #1394 base with @stukenov's #1364 pre-quant TTT. Supersedes this submission. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants