Skip to content

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289

Open
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-v16-scylla
Open

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-v16-scylla

Conversation

@MatoTeziTanka
Copy link
Copy Markdown

@MatoTeziTanka MatoTeziTanka commented Apr 3, 2026

Result

val_bpb: 1.0819 (3-seed mean, std: 0.00088) | Scylla tokenizer (998 tokens) | 8×H100 SXM

Seed Sliding Window BPB Roundtrip BPB Steps Train Time
42 1.08075 1.10284 5,884 600.1s
1337 1.08289 1.10489 5,905 600.0s
2024 1.08213 1.10421 5,894 600.0s

What We Built

This is the 8th submission in the PROTEUS series — an iterative engineering effort across 7 prior PRs (#95, #368, #512, #568, #633, #769, #1274), and documented negative results (INT4 catastrophic, depth recurrence overhead, SWA alone, output-level bigram tables). Each failure informed the next attempt.

Original Engineering in This Submission

  1. Sensitivity-driven mixed INT5/INT6 quantization — Dynamic per-layer quantization selection via N_INT6_LAYERS control. INT5 for middle MLP layers, INT6 for attention + first/last MLP layers. Our mixed_quantize_int6() extends the community function with int4_cats parameter and sensitivity-aware layer routing not present in prior implementations (code: lines 2021-2061, 2533-2573).

  2. Learnable lane merge + resid_mix_mlp — We added two learnable parameters on top of the parallel residuals architecture: a scalar lane_merge (line 1105) that blends attention and MLP residual streams, and a per-dimension resid_mix_mlp (line 946) that routes MLP vs attention inputs. Neither parameter exists in the source parallel residuals PR. These are our additions that let the model learn its own mixing strategy rather than using fixed routing.

  3. Scylla retokenization pipeline — Complete pipeline to convert SP1024 FineWeb shards to the Scylla (TokenMonster) vocabulary. Chunked decode/re-encode with validation. PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143 introduced the Scylla tokenizer but did not include a retokenization tool — we wrote one from scratch.

  4. Integration engineering — Getting parallel residuals (PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204), depth recurrence, legal TTT (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461), Scylla tokenizer (PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143), and the base architecture (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019) to work together in a single training run required solving compatibility issues across 5 independent codebases. This is non-trivial systems work.

  5. CPU e2e test suite — 10 automated test cases: import validation, hyperparameter extraction, model creation, forward pass, code size, quantization+artifact size, step time projection, quantization MSE analysis, scale timing benchmark, and weight distribution analysis. Full pre-flight before any GPU spend.

  6. Controlled slope experiment — 7-point LeakyReLU negative slope sweep (0.1–0.9) under identical conditions showing monotonic improvement, posted on issue #140. Follow-up A/B test on this architecture showed slope=0.9 is 0.0054 BPB worse with parallel residuals — the parallel lanes prefer more aggressive gating at 0.5. Published the negative result too.

Community Infrastructure


Builds On (full attribution)

This submission integrates techniques from the community. Every component is credited:

Component What We Used PR Original Author
Base architecture LeakyReLU² + Parallel Muon #549 @abaybektursun
GPTQ + XSA + BigramHash AR Self-Gen GPTQ calibration #1019 @abaybektursun
Scylla tokenizer TokenMonster-derived 998-token vocab #1143 @simon-marcus
Parallel residuals + depth recurrence Separate attn/MLP lanes from layer 7 #1204 @msisovic
Legal TTT framework Score-first SGD with frozen early blocks #461 @Christopher-Lee-McClendon
LeakyReLU² activation Squared LeakyReLU #493 @parinzee
XSA Exclusive Self-Attention #265 @unnir
BigramHash + SmearGate Hash bigram embeddings + gate #102 @unnir
Value Embeddings VE128 on deep layers #414 @signalrush

Note on Scylla: PR #1143 was closed by the author after byte-accounting errors (~4-6% BPB inflation). Our implementation is verified immune — base_bytes[i] = len(token_i.encode('utf-8')) for all 998 tokens, has_leading_space and is_boundary_token both all-False, five zero-byte tokens correctly handled. All 5 eval functions use identical byte-counting logic.


Architecture

11L/512d/8H/4KV, MLP 3× LeakyReLU(0.5)², XSA last 4, Partial RoPE 16d, LN Scale, BigramHash, SmearGate, VE 128d (layers 9-10), EMA 0.997, QAT, Mixed INT5/INT6+LZMA, Muon optimizer, Parallel Residuals (from layer 7), Mini Depth Recurrence (layers 4-5, from step 3000), Legal Score-First TTT. Scylla tokenizer (998 tokens, TokenMonster-derived).

Compliance (per #677)

  • 8×H100 SXM training
  • 10-minute wallclock (600s)
  • Artifact ≤ 16 MB — Verified. Raw serialized model is byte-identical across runs (112,437,059 bytes). With mixed_quant+brotli compression: 15,006,141 bytes (seed 42), 15,012,088 (seed 1337), 15,019,048 (seed 2024). All under budget. Architecture config confirmed identical between measured runs and this submission (30,126,957 params, same quantization scheme).
  • No n-gram cache at eval (NGRAM_ENABLED defaults to 0)
  • No two-pass rescoring
  • Score-first TTT (tokens scored before weight update)
  • Autoregressive eval (causal)
  • 3-seed validation (42: 1.0808, 1337: 1.0829, 2024: 1.0821)

Note on PR #1274

Prior submission was self-closed due to incorrect attribution that misrepresented community techniques as our own. This resubmission corrects that with explicit provenance for every integrated component.

Platform

RunPod 8×H100 80GB SXM, PyTorch 2.11.0+cu128.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Mato and others added 3 commits March 29, 2026 10:23
Fork landing page now introduces The Agora with links to the live site,
issue templates, and discussions. Original OpenAI README preserved below.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GitHub only discovers workflows from the default branch (main).
The workflow checks out gh-pages and runs the pipeline there.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… + Legal TTT — val_bpb 1.0819

3-seed mean: 1.0819 BPB (std: 0.00088)
Seeds: 42=1.0808, 1337=1.0829, 2024=1.0821

Integration of community techniques with full attribution:
- Base: PR openai#549, openai#1019 by @abaybektursun
- Scylla tokenizer: PR openai#1143 by @simon-marcus
- Parallel residuals + depth recurrence: PR openai#1204 by @msisovic
- Legal TTT: PR openai#461 by @Christopher-Lee-McClendon

Our engineering: mixed INT5/INT6 quantization, learnable lane merge,
Scylla retokenization pipeline, integration work, CPU e2e test suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant