Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
cc6a63e
Add LongContext4096 + QAT Int4 16L experiments
FlashyFlash3011 Mar 21, 2026
538bfa6
Fix warmdown and rope_base in LongContext4096 script
FlashyFlash3011 Mar 22, 2026
54b969a
Add LongContext4096 + Full SOTA stack experiment (2-line diff from PR…
FlashyFlash3011 Mar 24, 2026
e4ee89f
Add 4 experiments: fix QAT mismatch, add int6/int4 bank-QAT scripts
FlashyFlash3011 Mar 25, 2026
15e6d9e
Add run_experiments.sh and reset RESULTS.md with correct iteration co…
FlashyFlash3011 Mar 25, 2026
ee26d04
Fix BASE path in run_experiments.sh
FlashyFlash3011 Mar 25, 2026
b7eb0ed
Fix lzma preset, TTT stride, add QAT exp to run script
FlashyFlash3011 Mar 25, 2026
b688621
results: 2026-03-25_LongContext4096_Int6_QAT seed1337
Mar 26, 2026
b937791
add recompress_l9.py utility
FlashyFlash3011 Mar 26, 2026
12edb34
exp6: Int6_QAT_2048 — same as Exp5 but ctx=2048 for size+speed fix
FlashyFlash3011 Mar 26, 2026
0b6146d
exp6: full bank QAT + submission.json
FlashyFlash3011 Mar 26, 2026
b992b40
exp7: Int6_QAT_2048_LateBank — late bank QAT + MLP_MULT=2.75 + 2048 ctx
FlashyFlash3011 Mar 26, 2026
b371be9
results: 2026-03-26_Int6_QAT_2048_LateBank seed1337
Mar 26, 2026
3c73097
fix: clamp QAT range -32->-31 to match export symmetric range
FlashyFlash3011 Mar 26, 2026
8ccc5d2
reset: remove failed exps, add BankQAT_2048train_4096eval (Option B)
FlashyFlash3011 Mar 26, 2026
4c492e6
exp: LongContext4096 + BankQAT + GatedAttn + ValueResid + zstd-22
FlashyFlash3011 Mar 26, 2026
97d4cda
fix: lzma-9 compression, TTT epochs=1/lr=0.001/freeze=4 to prevent fo…
FlashyFlash3011 Mar 26, 2026
1cc698c
tune: bank_qat_threshold 0.15->0.05 (less warmdown noise), target_mb …
FlashyFlash3011 Mar 26, 2026
74c4ce7
exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT
FlashyFlash3011 Mar 27, 2026
d1563e1
exp: 2026-03-27_GPTQLite_QAT_MaxLZMA_LegalTTT — lzma-9, bank_qat=0.05…
FlashyFlash3011 Mar 27, 2026
4355194
fix: add git identity + save_and_push after each seed for auto-commit…
FlashyFlash3011 Mar 27, 2026
7fc776c
cleanup: remove old experiments, slim run_experiments.sh to GPTQLite …
FlashyFlash3011 Mar 27, 2026
6351d76
fix: remove TTT cosine LR decay — use constant ttt_lr across all chunks
FlashyFlash3011 Mar 28, 2026
3dd38f2
revert: restore TTT cosine LR decay
FlashyFlash3011 Mar 28, 2026
9db1393
official: lock submission command, update blurb with seed1337 result
FlashyFlash3011 Mar 29, 2026
0e03e27
results: GPTQLite_QAT_MaxLZMA_LegalTTT seed1337
FlashyFlash3011 Mar 29, 2026
6b3441b
docs: rewrite README — Pure Velocity strategy, seed1337 results
FlashyFlash3011 Mar 29, 2026
ad1e888
cleanup: remove run_experiments.sh and RESULTS.md, rename seed1337 log
FlashyFlash3011 Mar 29, 2026
3277196
results: seed1337 — 1.11901 BPB, 15.851MB, 7155 steps
FlashyFlash3011 Mar 29, 2026
cabd694
results: GPTQLite_QAT_MaxLZMA_LegalTTT seed42
Mar 29, 2026
9cd1489
results: seed42 — 1.11961 BPB, 15.858MB, 7156 steps
FlashyFlash3011 Mar 30, 2026
211abea
docs: add planned changes, beyond-constraints section, and headroom/s…
FlashyFlash3011 Mar 30, 2026
7566466
docs: remove base script reference, fix duplicate sections, clean up …
FlashyFlash3011 Mar 30, 2026
92e02e0
docs: remove checkmark emoji
FlashyFlash3011 Mar 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,6 @@ data/manifest.json
data/docs_selected.jsonl
.mypy_cache/
.venv
logs/
logs/
*.pt
*.ptz
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# GPTQLite: Pure Velocity & TTT Preservation

**Target val_bpb: < 1.1194** (beat leaderboard #1)

## Results (8×H100 80GB SXM)

| Seed | step_avg | steps | Pre-TTT bpb | **Post-TTT bpb** | TTT gain | TTT time | Artifact |
|------|----------|-------|-------------|-----------------|----------|----------|----------|
| 1337 | 83.87ms | 7155 | 1.12164 | **1.11901** | -0.00263 | 421.9s | 15.851MB |
| 42 | 83.86ms | 7156 | 1.12229 | **1.11961** | -0.00268 | 423.2s | 15.858MB |
| 2025 | — | — | — | **—** | — | — | — |
| **Mean** | — | — | — | **—** | — | — | — |

## Strategy: Pure Velocity & TTT Preservation

Initial attempts tried to maximize model capacity (GatedAttention, ValueResidual, BigramHash=2048). Ablations showed these features add ~1.5ms/step overhead and destabilize TTT, costing more in training steps than they gain in quality under the 10min/16MB constraint.

The winning strategy strips the model to its leanest form — more training steps, cleaner TTT.

## Key Changes

### 1. Architecture Speed Diet
| Flag | Value | Why |
|------|-------|-----|
| `GATED_ATTENTION` | **0** (disabled) | Adds ~1.5ms/step overhead — costs 130+ training updates over 600s |
| `VALUE_RESIDUAL` | **0** (disabled) | Same overhead, no net gain under the time constraint |
| `BIGRAM_VOCAB_SIZE` | **1536** | Keeps artifact lean; 2048 pushed artifact over limit |

### 2. SWA Removed
| Flag | Value | Why |
|------|-------|-----|
| `SWA_ENABLED` | **0** | Was copying hundreds of MB of tensors GPU→CPU every 50 steps pointlessly — the script uses EMA weights at the end, not SWA. Disabling buys ~30 extra training steps. |

### 3. QAT Simplified
| Flag | Value | Why |
|------|-------|-----|
| `QAT_ENABLED` | not set (off) | Full QAT from step 1 adds math overhead throughout training |
| `LATE_QAT_THRESHOLD` | **0.15** | Quantization activates only in the final 15% of warmdown |
| `BANK_QAT_THRESHOLD` | **0** | Bank QAT was snapping finely-tuned FP32 TTT weights back to Int6 mid-evaluation, causing catastrophic forgetting |

## Unchanged

| Feature | Setting |
|---------|---------|
| **Architecture** | 11L, 512d, 8H, 4KV, 3× MLP |
| **Activation** | LeakyReLU(0.5)² — hardcoded |
| **XSA** | Last 4 layers |
| **VE** | dim=128, layers 9,10 |
| **Partial RoPE** | 16/64 dims, NTK scaling |
| **LN Scale** | 1/√(layer+1) |
| **EMA** | decay=0.997, applied at end of training |
| **Quantization** | GPTQ-lite int6 + zstd-22 |
| **Optimizer** | Parallel Muon + Parameter Banking — all LRs/WDs identical |
| **Legal TTT** | score-first, 3 epochs, freeze=0, lr=0.002, SGD+momentum(0.9) |
| **Training** | TRAIN_SEQ_LEN=2048, EVAL_STRIDE=64, WARMDOWN_ITERS=3500 |

## Run Command

```bash
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=4 GATED_ATTENTION=0 VALUE_RESIDUAL=0 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 ROPE_DIMS=16 LN_SCALE=1 \
MUON_WD=0.04 ADAM_WD=0.04 MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
LATE_QAT_THRESHOLD=0.15 BANK_QAT_THRESHOLD=0 SWA_ENABLED=0 \
TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 \
ITERATIONS=9000 WARMDOWN_ITERS=3500 MAX_WALLCLOCK_SECONDS=600 \
DATA_PATH=$DATA TOKENIZER_PATH=$TOK SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee train_seed1337.log
```

## Timing Budget (actual, seed 1337)

| Phase | Time |
|-------|------|
| Training (wallclock cap) | 600s |
| EMA apply + diagnostic eval | ~2s |
| int6 roundtrip eval | ~6s |
| Sliding window eval (2048, stride=64) | ~75s |
| Legal TTT (3ep, all blocks, 2048 ctx) | ~425s |
| **Total eval** | **~508s** |

## Features Explored but Disabled (Not Used in Final Submission)

These changes were designed, implemented, and tested but disabled because the 10min/16MB constraint made their compute overhead a net negative. They remain in the codebase.

| Feature | Expected Δ BPB | Why disabled | Why it helps with more budget |
|---------|---------------|--------------|-------------------------------|
| **GatedAttention** (PR #841) | -0.002 to -0.005 | +1.5ms/step → 130+ lost training steps | Per-head sigmoid gates improve attention expressivity; pays off with 30min+ training |
| **ValueResidual** (PR #841) | included above | Same compute overhead | Layer-0 value injection improves gradient flow across deep layers |
| **BigramHash=2048** | -0.001 to -0.002 | Pushed artifact over 16MB limit | More bigram vocabulary = better subword context modeling |
| **QAT from step 1** | -0.001 to -0.003 | Overhead throughout all ~7000 steps | Full-run quantization adaptation significantly reduces post-quant degradation |
| **BANK_QAT_THRESHOLD > 0** | enables compression | Corrupts TTT weights mid-evaluation | With a larger artifact budget, enables aggressive int6 compression of a much bigger model |

### Headroom & Scaling Evidence

The final submission sits at **~15.851MB** — leaving ~149KB of the 16MB budget unused. Attempts to fill that headroom by increasing `BIGRAM_VOCAB_SIZE` to 1664 and then 2048 produced worse BPB and pushed the artifact over the limit, confirming the model is already well-optimized for this constraint.

In an uncapped scenario (larger artifact budget + longer training), all of these levers can be opened simultaneously for significantly better BPB than the current 1.119x.

## Credits

- **LeakyReLU² activation**: PR #493 by @parinzee, PR #518 by @sofiabod
- **Optimizer (Parameter Banking + Parallel Muon)**: PR #399 by @abaybektursun
- **TTT recipe**: PR #461 by @Christopher-Lee-McClendon
- **Base model**: PR #414 by @signalrush
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
{
"name": "GPTQLite_QAT_MaxLZMA_LegalTTT",
"val_bpb": null,
"bytes_total": null,
"blurb": "11L 512d LeakyReLU(0.5)^2 + XSA-4 + Partial RoPE + LN Scale + VE128 + EMA(0.997) + GPTQ-lite int6 + zstd-22 + Legal score-first TTT (3ep SGD momentum=0.9, all blocks, lr=0.002) + Parameter Banking + Parallel Muon. Seed 1337: 1.11964 BPB, 15.861MB.",
"author": "FlashyFlash3011",
"github_id": "FlashyFlash3011",
"date": "2026-03-27"
}
Loading