Skip to content

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598

Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-gepa-mixed-quant-7k-legal-ttt
Open

Non-Record: BPB 1.1334 — 7000-Step Training + Mixed Int6/Int8 Quantization + Legal TTT#598
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-gepa-mixed-quant-7k-legal-ttt

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Mar 24, 2026

Non-Record: 4×A100-40GB

val_bpb = 1.1334 | Pre-TTT: 1.1476 | Artifact: 15.70 MB (headroom: 297 KB)


What Changed

Extended training to 7000 steps (from the typical 5200) with a longer warmdown cosine anneal (step 3500→7000), combined with mixed int6/int8 quantization to keep the 27M-parameter model under 16 MB. Legal score-first TTT (10 epochs SGD with momentum) yields a further −0.0142 BPB improvement.

This Work Prior VE128+RoPE (30ep TTT)
Pre-TTT BPB 1.1476 1.1609
Post-TTT BPB 1.1334 1.1425
Training steps 7,000 5,200
TTT epochs 10 30
Eval time 2,194s 3,662s
Artifact size 15.70 MB 15.65 MB

The base model improvement (−0.0133 pre-TTT) comes from longer training plus the GEPA architecture. Fewer TTT epochs (10 vs 30) mean faster eval (40% less wall time) at the cost of a smaller TTT gain (−0.0142 vs −0.0184).

Architecture: 11L GEPA

  • 11 unique layers (no depth recurrence), d=512, 8Q/4KV GQA heads
  • ReLU² (Star-ReLU) activation in 3× MLP (1536 hidden)
  • Cross-sequence attention (XSA) on last 4 layers
  • Exponential moving average (decay 0.997)
  • Bigram hash embeddings (2048 buckets, 128d)
  • Partial RoPE (16/64 dims) with YARN scaling
  • Value embeddings (128d) on layers 9–10
  • U-Net skip connections across layer pairs
  • LN depth scaling (1/√(layer+1))
  • Late QAT with GPTQ-lite clip search (5 candidates/row), enabled at step 6476

Mixed Quantization

Dual-scheme compression for the 27M-parameter model:

  • Int6 per-row (GPTQ-lite): attention projections + MLP weights (bulk of params)
  • Int8 per-tensor (scalar scale): layer norms, value embeddings, biases, embedding tables

27.5 MB payload → 15.63 MB after zstd-22 (3.89× compression) + 76 KB code = 15.70 MB total.

TTT Protocol (Legal Score-First)

SGD with momentum (0.9) at lr=0.002, 10 epochs per 32K-token chunk, stride=64, freezing first 2 blocks. Score-first: every token scored under torch.inference_mode() before any weight update.

Limitations

  • Single seed (42) — no variance estimate. Acceptable for non-record but results may vary across seeds.
  • No ablation of individual GEPA components. The architecture combines multiple techniques without isolating their contributions.
  • No LeakyReLU: PR Non-Record: BPB 1.13872 — LeakyReLU(0.5)² + Per-Layer LR Legal TTT (3 seeds) #537 showed LeakyReLU(0.5)² helps (−0.0035 BPB). This submission uses standard ReLU² instead — combining with LeakyReLU is an obvious next step.

Credits

And all contributors to the parameter-golf competition.

… BPB)

- Non-record submission: 1.1334 BPB, 15.70 MB artifact (4×A100-40GB)
- Mixed quantization: int6 per-row for MLP/attn, int8 per-tensor for rest
- 7000 training steps (vs 5200 baseline) with GEPA architecture
- Legal score-first TTT: SGD 10 epochs, -0.0142 BPB gain
- Beats prior non-record best (1.1425) by 0.009 BPB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant