Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) by MatoTeziTanka · Pull Request #1289 · openai/parameter-golf

MatoTeziTanka · 2026-04-03T07:37:15Z

Result

val_bpb: 1.0819 (3-seed mean, std: 0.00088) | Scylla tokenizer (998 tokens) | 8×H100 SXM

Seed	Sliding Window BPB	Roundtrip BPB	Steps	Train Time
42	1.08075	1.10284	5,884	600.1s
1337	1.08289	1.10489	5,905	600.0s
2024	1.08213	1.10421	5,894	600.0s

What We Built

This is the 8th submission in the PROTEUS series — an iterative engineering effort across 7 prior PRs (#95, #368, #512, #568, #633, #769, #1274), and documented negative results (INT4 catastrophic, depth recurrence overhead, SWA alone, output-level bigram tables). Each failure informed the next attempt.

Original Engineering in This Submission

Sensitivity-driven mixed INT5/INT6 quantization — Dynamic per-layer quantization selection via N_INT6_LAYERS control. INT5 for middle MLP layers, INT6 for attention + first/last MLP layers. Our mixed_quantize_int6() extends the community function with int4_cats parameter and sensitivity-aware layer routing not present in prior implementations (code: lines 2021-2061, 2533-2573).
Learnable lane merge + resid_mix_mlp — We added two learnable parameters on top of the parallel residuals architecture: a scalar lane_merge (line 1105) that blends attention and MLP residual streams, and a per-dimension resid_mix_mlp (line 946) that routes MLP vs attention inputs. Neither parameter exists in the source parallel residuals PR. These are our additions that let the model learn its own mixing strategy rather than using fixed routing.
Scylla retokenization pipeline — Complete pipeline to convert SP1024 FineWeb shards to the Scylla (TokenMonster) vocabulary. Chunked decode/re-encode with validation. PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143 introduced the Scylla tokenizer but did not include a retokenization tool — we wrote one from scratch.
Integration engineering — Getting parallel residuals (PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204), depth recurrence, legal TTT (PR Non-record: 11L Depth Recurrence + High-Yield Legal TTT (1.14458 BPB) #461), Scylla tokenizer (PR Record: Scylla (novel tokenizer) + Legal Score-First TTT (val_bpb: 1.08056553) #1143), and the base architecture (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549/Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019) to work together in a single training run required solving compatibility issues across 5 independent codebases. This is non-trivial systems work.
CPU e2e test suite — 10 automated test cases: import validation, hyperparameter extraction, model creation, forward pass, code size, quantization+artifact size, step time projection, quantization MSE analysis, scale timing benchmark, and weight distribution analysis. Full pre-flight before any GPU spend.
Controlled slope experiment — 7-point LeakyReLU negative slope sweep (0.1–0.9) under identical conditions showing monotonic improvement, posted on issue #140. Follow-up A/B test on this architecture showed slope=0.9 is 0.0054 BPB worse with parallel residuals — the parallel lanes prefer more aggressive gating at 0.5. Published the negative result too.

Community Infrastructure

The Agora — The only compliance classification engine + live leaderboard + regulatory tracker in the competition. Auto-classifies 996+ PRs into ALIVE/AT_RISK/DEAD/NOTABLE/INCOMPLETE via a multi-gate pipeline (classify.py). No other competitor built community transparency infrastructure at this scale — two other leaderboards exist (stpcoder, dexhunter CLI) but neither has compliance analysis, technique mapping, or rules documentation.
14 community contributions across issues ⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140, Credits for Runpod from OpenAI #942, A Field Guide to Valid Submissions #1017, Non-record submissions without 8 H100s #1175 plus the 8 PROTEUS PRs.

Builds On (full attribution)

This submission integrates techniques from the community. Every component is credited:

Component	What We Used	PR	Original Author
Base architecture	LeakyReLU² + Parallel Muon	#549	@abaybektursun
GPTQ + XSA + BigramHash	AR Self-Gen GPTQ calibration	#1019	@abaybektursun
Scylla tokenizer	TokenMonster-derived 998-token vocab	#1143	@simon-marcus
Parallel residuals + depth recurrence	Separate attn/MLP lanes from layer 7	#1204	@msisovic
Legal TTT framework	Score-first SGD with frozen early blocks	#461	@Christopher-Lee-McClendon
LeakyReLU² activation	Squared LeakyReLU	#493	@parinzee
XSA	Exclusive Self-Attention	#265	@unnir
BigramHash + SmearGate	Hash bigram embeddings + gate	#102	@unnir
Value Embeddings	VE128 on deep layers	#414	@signalrush

Note on Scylla: PR #1143 was closed by the author after byte-accounting errors (~4-6% BPB inflation). Our implementation is verified immune — base_bytes[i] = len(token_i.encode('utf-8')) for all 998 tokens, has_leading_space and is_boundary_token both all-False, five zero-byte tokens correctly handled. All 5 eval functions use identical byte-counting logic.

Architecture

11L/512d/8H/4KV, MLP 3× LeakyReLU(0.5)², XSA last 4, Partial RoPE 16d, LN Scale, BigramHash, SmearGate, VE 128d (layers 9-10), EMA 0.997, QAT, Mixed INT5/INT6+LZMA, Muon optimizer, Parallel Residuals (from layer 7), Mini Depth Recurrence (layers 4-5, from step 3000), Legal Score-First TTT. Scylla tokenizer (998 tokens, TokenMonster-derived).

Compliance (per #677)

8×H100 SXM training
10-minute wallclock (600s)
Artifact ≤ 16 MB — Verified. Raw serialized model is byte-identical across runs (112,437,059 bytes). With mixed_quant+brotli compression: 15,006,141 bytes (seed 42), 15,012,088 (seed 1337), 15,019,048 (seed 2024). All under budget. Architecture config confirmed identical between measured runs and this submission (30,126,957 params, same quantization scheme).
No n-gram cache at eval (NGRAM_ENABLED defaults to 0)
No two-pass rescoring
Score-first TTT (tokens scored before weight update)
Autoregressive eval (causal)
3-seed validation (42: 1.0808, 1337: 1.0829, 2024: 1.0821)

Note on PR #1274

Prior submission was self-closed due to incorrect attribution that misrepresented community techniques as our own. This resubmission corrects that with explicit provenance for every integrated component.

Platform

RunPod 8×H100 80GB SXM, PyTorch 2.11.0+cu128.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Fork landing page now introduces The Agora with links to the live site, issue templates, and discussions. Original OpenAI README preserved below. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GitHub only discovers workflows from the default branch (main). The workflow checks out gh-pages and runs the pipeline there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@abaybektursun

… + Legal TTT — val_bpb 1.0819 3-seed mean: 1.0819 BPB (std: 0.00088) Seeds: 42=1.0808, 1337=1.0829, 2024=1.0821 Integration of community techniques with full attribution: - Base: PR openai#549, openai#1019 by @abaybektursun - Scylla tokenizer: PR openai#1143 by @simon-marcus - Parallel residuals + depth recurrence: PR openai#1204 by @msisovic - Legal TTT: PR openai#461 by @Christopher-Lee-McClendon Our engineering: mixed INT5/INT6 quantization, learnable lane merge, Scylla retokenization pipeline, integration work, CPU e2e test suite.

Mato and others added 3 commits March 29, 2026 10:23

Add AGORA header to main branch README

1aec0da

Fork landing page now introduces The Agora with links to the live site, issue templates, and discussions. Original OpenAI README preserved below. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add update-site workflow to main branch for GitHub discovery

ad29bed

GitHub only discovers workflows from the default branch (main). The workflow checks out gh-pages and runs the pipeline there. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka mentioned this pull request Apr 3, 2026

Illegal submissions megathread #677

Open

This was referenced Apr 3, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1296

Open

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

simon-marcus mentioned this pull request Apr 3, 2026

Scylla: Corrected Byte-Exact Tokenizer Path #1314

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289

Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean)#1289
MatoTeziTanka wants to merge 3 commits intoopenai:mainfrom
MatoTeziTanka:proteus-v16-scylla

MatoTeziTanka commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MatoTeziTanka commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Result

What We Built

Original Engineering in This Submission

Community Infrastructure

Builds On (full attribution)

Architecture

Compliance (per #677)

Note on PR #1274

Platform

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MatoTeziTanka commented Apr 3, 2026 •

edited

Loading