Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289
Open
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
Open
Conversation
Fork landing page now introduces The Agora with links to the live site, issue templates, and discussions. Original OpenAI README preserved below. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub only discovers workflows from the default branch (main). The workflow checks out gh-pages and runs the pipeline there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + Legal TTT — val_bpb 1.0819 3-seed mean: 1.0819 BPB (std: 0.00088) Seeds: 42=1.0808, 1337=1.0829, 2024=1.0821 Integration of community techniques with full attribution: - Base: PR openai#549, openai#1019 by @abaybektursun - Scylla tokenizer: PR openai#1143 by @simon-marcus - Parallel residuals + depth recurrence: PR openai#1204 by @msisovic - Legal TTT: PR openai#461 by @Christopher-Lee-McClendon Our engineering: mixed INT5/INT6 quantization, learnable lane merge, Scylla retokenization pipeline, integration work, CPU e2e test suite.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Result
val_bpb: 1.0819 (3-seed mean, std: 0.00088) | Scylla tokenizer (998 tokens) | 8×H100 SXM
What We Built
This is the 8th submission in the PROTEUS series — an iterative engineering effort across 7 prior PRs (#95, #368, #512, #568, #633, #769, #1274), and documented negative results (INT4 catastrophic, depth recurrence overhead, SWA alone, output-level bigram tables). Each failure informed the next attempt.
Original Engineering in This Submission
Sensitivity-driven mixed INT5/INT6 quantization — Dynamic per-layer quantization selection via
N_INT6_LAYERScontrol. INT5 for middle MLP layers, INT6 for attention + first/last MLP layers. Ourmixed_quantize_int6()extends the community function withint4_catsparameter and sensitivity-aware layer routing not present in prior implementations (code: lines 2021-2061, 2533-2573).Learnable lane merge +
resid_mix_mlp— We added two learnable parameters on top of the parallel residuals architecture: a scalarlane_merge(line 1105) that blends attention and MLP residual streams, and a per-dimensionresid_mix_mlp(line 946) that routes MLP vs attention inputs. Neither parameter exists in the source parallel residuals PR. These are our additions that let the model learn its own mixing strategy rather than using fixed routing.Scylla retokenization pipeline — Complete pipeline to convert SP1024 FineWeb shards to the Scylla (TokenMonster) vocabulary. Chunked decode/re-encode with validation. PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143 introduced the Scylla tokenizer but did not include a retokenization tool — we wrote one from scratch.
Integration engineering — Getting parallel residuals (PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204), depth recurrence, legal TTT (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461), Scylla tokenizer (PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143), and the base architecture (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019) to work together in a single training run required solving compatibility issues across 5 independent codebases. This is non-trivial systems work.
CPU e2e test suite — 10 automated test cases: import validation, hyperparameter extraction, model creation, forward pass, code size, quantization+artifact size, step time projection, quantization MSE analysis, scale timing benchmark, and weight distribution analysis. Full pre-flight before any GPU spend.
Controlled slope experiment — 7-point LeakyReLU negative slope sweep (0.1–0.9) under identical conditions showing monotonic improvement, posted on issue #140. Follow-up A/B test on this architecture showed slope=0.9 is 0.0054 BPB worse with parallel residuals — the parallel lanes prefer more aggressive gating at 0.5. Published the negative result too.
Community Infrastructure
Builds On (full attribution)
This submission integrates techniques from the community. Every component is credited:
Note on Scylla: PR #1143 was closed by the author after byte-accounting errors (~4-6% BPB inflation). Our implementation is verified immune —
base_bytes[i] = len(token_i.encode('utf-8'))for all 998 tokens,has_leading_spaceandis_boundary_tokenboth all-False, five zero-byte tokens correctly handled. All 5 eval functions use identical byte-counting logic.Architecture
11L/512d/8H/4KV, MLP 3× LeakyReLU(0.5)², XSA last 4, Partial RoPE 16d, LN Scale, BigramHash, SmearGate, VE 128d (layers 9-10), EMA 0.997, QAT, Mixed INT5/INT6+LZMA, Muon optimizer, Parallel Residuals (from layer 7), Mini Depth Recurrence (layers 4-5, from step 3000), Legal Score-First TTT. Scylla tokenizer (998 tokens, TokenMonster-derived).
Compliance (per #677)
NGRAM_ENABLEDdefaults to0)Note on PR #1274
Prior submission was self-closed due to incorrect attribution that misrepresented community techniques as our own. This resubmission corrects that with explicit provenance for every integrated component.
Platform
RunPod 8×H100 80GB SXM, PyTorch 2.11.0+cu128.
Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.