Skip to content

Non-record: 3x MLP + Quantization-Aware Training (STE)#1595

Closed
seekerPrice wants to merge 1 commit intoopenai:mainfrom
seekerPrice:submission/mlp3x-qat-int6
Closed

Non-record: 3x MLP + Quantization-Aware Training (STE)#1595
seekerPrice wants to merge 1 commit intoopenai:mainfrom
seekerPrice:submission/mlp3x-qat-int6

Conversation

@seekerPrice
Copy link
Copy Markdown

Summary

  • 3x MLP expansion + Quantization-Aware Training with Straight-Through Estimator
  • Local MLX result: val_bpb = 2.2240 (baseline 2.2290, -0.005 BPB improvement)
  • H100 validation pending (applying for compute credits)

Key Innovation: QAT-STE

Injects int6 quantization noise during training via straight-through estimator, starting at 30% of iterations. The model learns weight distributions that are robust to post-training quantization — instead of training blind and hoping quantization doesn't hurt.

Ablation Results (300 steps, SP1024, M5 MacBook)

Config val_bpb vs Baseline
Baseline (2x MLP) 2.2290
3x MLP (ours) 2.2240 -0.005
10 layers 2.2249 -0.004
QK gain 5.0 2.2624 +0.033
Depth recurrence 2.2899 +0.061

Test plan

  • H100 20K-step training with QAT
  • 3-seed validation (seeds 42, 314, 999)
  • SP4096/SP8192 tokenizer integration
  • Test QAT quantization gap vs standard GPTQ

🤖 Generated with Claude Code

- 3x MLP expansion: proven -0.005 BPB vs baseline (2.2240 vs 2.2290)
- QAT with straight-through estimator: trains int6-friendly weights
- Full ablation table included in README
- H100 validation pending

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@seekerPrice
Copy link
Copy Markdown
Author

Progress Update (April 14)

Significant breakthroughs since initial submission:

Results (local M5, SP1024/SP4096)

Config Steps val_bpb
Baseline (unmodified) 300 2.2290
3xMLP + 10L + fixed warmdown 300 2.1387
10L + 3xMLP + LeakyReLU(0.5)² 2000 1.6366
SP4096 casefold + 11L + 4xMLP + SOTA params 3000 1.6101
SP4096 + MicroNorm + CPMR + NEKP (running) 3000 TBD (tracking -0.29 loss advantage)

Key Discoveries

  1. Warmdown bug: WARMDOWN_ITERS=1200 with ITERATIONS=300 starts warmdown at step 0, capping LR at 25%. Fixed with proper scaling.
  2. LeakyReLU(0.5)²: Validated -0.003 BPB from SOTA analysis. Preserves negative gradient flow.
  3. 3-AI collaborative invention (Claude + Gemini + Codex): Produced 8 novel techniques, killed 6 bad ideas through cross-critique.

Novel Inventions (implemented, testing)

  • MicroNorm: Block-aligned normalization matching int6 quantization groups
  • CPMR: Causal Prefix Mean Residual — inject running context mean
  • NEKP: Nash Kurtosis Penalty — penalize weight outliers for better int6
  • Dichotomous Temperature: Per-head learned attention sharpness

Next Steps

  • Awaiting H100 compute credits for full 20K-step validation
  • SP8192 casefold tokenizer planned
  • Score-first TTT implementation ready

🤖 Generated with Claude Code

@seekerPrice
Copy link
Copy Markdown
Author

Superseded by PR #1612 which achieves val_bpb=1.5096 via tuned hyperparameters (-0.05 BPB vs SOTA defaults, reproducible A/B test). The 3x MLP + QAT approach in this PR was validated at only 300 steps with weaker results (2.2240 BPB). Focusing efforts on PR #1612 and follow-up H100 validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant