Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Record: Causal BackoffNgramMixer — val_bpb 0.3958 (3-seed mean)

## Summary

- **val_bpb: 0.3958** (3-seed mean, std 0.0011)
- Seeds: 7 (0.3948), 1337 (0.3957), 2024 (0.3969)
- 11L transformer (28M params) with LeakyReLU(0.75)², Parallel Muon, MTP heads=2
- **Causal BackoffNgramMixer**: orders 2–10, 4M flat hash buckets, entropy-adaptive alpha
- **Batched sliding-window eval with incremental n-gram updates** — score-first, then update counts after each batch. Strictly backward-looking, causal.
- Artifacts: 15,940,706 – 15,957,577 bytes (all under 16MB)
- Eval times: 583 – 596 seconds (all under 600s)
- Training: 6,987 steps in 600s on 8×H100 SXM
- Eval: ~226s (within 10-minute eval budget)
- Beats previous best BackoffNgramMixer (#803 at 0.4416) by **0.0392 BPB**

## Key Innovation: Swarm-Designed Architecture + Causal N-gram Eval

This submission was designed by a multi-agent Think Tank Swarm — a research system with 4 autonomous agents and a 500K-node typed-edge knowledge graph. The swarm ran investigation missions to evaluate training approaches, then the knowledge graph conditioned embedding initialization for semantically important tokens.

The compression gains come from the **BackoffNgramMixer at eval time**, not the swarm. The swarm's contribution is architectural: it designed the approach, selected the hyperparameters, and provides transparent decision logging during training. We are explicit about this — the swarm is the research system, the mixer is the compression engine.

| Configuration | BPB | Source |
|---|---|---|
| Neural baseline (sliding window, stride=64) | 1.1245 | Our training |
| + Causal BackoffNgramMixer (orders 2–10) | **0.4024** | This submission |
| Previous best n-gram (#803) | 0.4416 | @pentxayc |

The key difference from #803: our causal sequential chunk evaluation processes the full 62M-token validation set in order on every GPU rank (no sharding), building complete n-gram statistics. This gives higher-order n-grams (7–10) much stronger count statistics than rank-sharded approaches.

## Eval Stack

- **BackoffNgramMixer**: orders 2–10, 4,194,304 flat hash buckets per order, greedy cascade (highest matching order wins), min_count=1
- **Entropy-adaptive alpha**: `0.20 + 0.55 * sigmoid(2 * (H - 3.0))` — per-token blending based on model uncertainty. High entropy trusts n-gram more.
- **Proper full-vocabulary mixture**: `p_final = (1 - alpha) * p_neural + alpha * p_ngram` — all tokens have nonzero probability
- **Causal sequential chunk eval**: process validation tokens in `seq_len`-sized chunks. For each chunk: (1) forward the model to get logits, (2) score all tokens using the mixer's current n-gram state, (3) AFTER scoring, update n-gram counts with this chunk's tokens. Strictly backward-looking.
- **KG-conditioned embedding init**: 358 token importance scores from a 500K-node knowledge graph bias embeddings toward semantically important concepts at initialization (zero runtime cost)
- **Swarm decision log**: 4 agents (QAT timing, KG weight, gradient health, MTP weight) make training decisions every 800 steps via consensus voting. Total overhead: <300 microseconds.

## Training Stack

- 11 layers, 512d, 8 heads, 4 KV heads, 3× MLP
- LeakyReLU(0.75)² activation
- Parallel Muon optimizer (momentum 0.99, warmup from 0.92)
- Multi-Token Prediction (2 heads, weight=0.1, discarded at export)
- EMA weight averaging (0.997)
- BigramHash (2048) + SmearGate
- XSA (last 4 layers) + Partial RoPE + LN Scale
- Int6 quantization (GPTQ-lite + LZMA)
- No TTT (eval budget used for causal n-gram scoring instead)

## Legality

1. **Causal n-gram cache**: counts built from already-scored tokens only. Each chunk is scored first, then its tokens are added to the count tables. The n-gram state at chunk C contains only tokens from chunks 0 through C-1.
2. **No validation data during training**: model trained on FineWeb training split only. KG embedding init uses offline-computed importance scores, not validation data.
3. **Alpha formula**: fixed function of model entropy, computed before seeing the target token. No hindsight selection.
4. **Committed distribution**: `(1 - alpha) * p_neural + alpha * p_ngram` — proper mixture, all tokens have nonzero probability.
5. **No external downloads or network calls during eval.**
6. **Reproducible**: all hyperparameters controlled via environment variables. Random seed controls all stochastic operations.

## Reproduction

```bash
LATE_QAT_THRESHOLD=0 TTT_ENABLED=0 KG_LOSS_WEIGHT=0.1 \
USE_NGRAM_MIXER=1 NGRAM_ORDER=10 NGRAM_BUCKETS=4194304 \
ALPHA_BASE=0.20 ALPHA_RANGE=0.55 ALPHA_CENTER=3.0 \
COMPLEMENT_ALPHA=0 NGRAM_MIN_COUNT=1 \
SEED=1337 \
torchrun --nproc_per_node=8 train_gpt.py
```

Requires `swarm_agents.py` and `kg_data.py` in the same directory.

## Credits & Acknowledgments

This submission builds directly on techniques from several prior PRs:

- **#803** (@pentxayc) — Complementary Training + BackoffNgramMixer architecture. Our mixer is adapted from their implementation. Our causal sequential eval differs from their approach.
- **#779** (@BackoffNgramMixer author) — Original BackoffNgramMixer, flat hash table design, entropy-adaptive alpha formula.
- **#549** (@sanjeevmadhav) — LeakyReLU² + Legal TTT + Parallel Muon base stack.
- **#414** (@signalrush) — 11L EMA + GPTQ-lite + warmdown base architecture.
- **#315** (@jfprincz) — Partial RoPE + LN Scale + XSA4.

The novel contributions are: (1) causal sequential chunk evaluation giving all ranks full 62M-token n-gram statistics, (2) swarm-guided training with transparent decision logging, (3) knowledge graph-conditioned embedding initialization.

## Files

| File | Size | Purpose | In artifact? |
|------|------|---------|-------------|
| `train_gpt.py` | 99KB | Training + causal eval | Yes (code bytes) |
| `swarm_agents.py` | 18KB | Agents + VotingMesh + BackoffNgramMixer | No (imported) |
| `kg_data.py` | 1KB | Compressed KG importance data | No (imported) |

## Test Plan

- [x] Seed 7: **0.3948** BPB, 15,940,706 bytes, eval 583s
- [x] Seed 1337: **0.3957** BPB, 15,943,009 bytes, eval 594s
- [x] Seed 2024: **0.3969** BPB, 15,957,577 bytes, eval 596s
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"""Auto-generated KG importance data. Do not edit."""
KG_IMPORTANCE_B64 = "/Td6WFoAAATm1rRGAgAhARwAAAAQz1jM4AWZAptdADMAQN7VFifJpYP9M2+3RzRwlI23kNmAo30DBtfr3CUl2mbMFTqynLpKXMmUJl38JrUazafmN+ML1reirYszeABgzaMZKNapQNLOpnuhr+KnbuA6iEt+FzPb8UlXfnOMXTyWqZD4cAVo5hssRW/B0kA7c6JfgexAfopXlS2bP+/0JRDx5AFm+91YEJ/YtZ7bPOvDldkfhQslfTXgLJAO+VnMgwUlipppf8ippXc5ZbGNsx1xl+FBacfgF8AeCKqGOyyt3wTYCzyRGU324DwP8xy7uQxHr6WJqVWE3WJKvIJQLCphh53hsf1BrGgENqfim2urRxVGtVHpTdtaCN98BYcz6HIGwDB4jQJGZMnQbFxIQRrjrkjbYqJKlkWoGgSDw3SC89SaXUZzKBh9BkuwDXuJh8i7NL86+D+lZsKowB8dtnpl+1uZlJBzCESbZ8A1r62l72fzXlmunKEtzn2w+Tiq09+OIw7XNznLVBqM+KiEIUd3m/HPDfsB053ts+nFEWkWFtJAEO2DY8QWJlQSMFe9OTe5XkytPpBz4d9kWDjPe2RlkU0k5YWHTuyPVCk4s3Ogzf3B+DZtIKNnhgq2NM6wj00XJZDeWMyMxYOM9qYc5Age8ruwiuB1ZiaWC7UEpDOWOpnADxKjS4riNwk7fJx/yB6xwRNob1Gkjr7Xigf5ZW12sVexVW0ROfSCPRk8/xg3R3kik+8OfI4BzhTlIPFL6d0kQctznW5oryynRKqR7QiVKHQ7SrMJwTSA7dqsTm/8pFL2vJ51X9sxb0A/14eYw1VuVLe7knZyv7IE+KXI/hkhttG5YlBOQCq0uB5sDfhtWEeGfI+MKavNUpUiJrOpS7ipIkfhAtCK8PrMaGVKNqhi5GGG03LW7QkAAIxLLoBfvn1JAAG3BZoLAABBltLSscRn+wIAAAAABFla"
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
{
"author": "michaelwinczuk",
"github_id": "michaelwinczuk",
"val_bpb": 0.3958,
"val_bpb_std": 0.0011,
"seeds": {
"1337": 0.3957,
"7": 0.3948,
"2024": 0.3969
},
"artifact_bytes": {
"1337": 15943009,
"7": 15940706,
"2024": 15957577
},
"eval_time_seconds": {
"1337": 594,
"7": 583,
"2024": 596
},
"approach": "Causal BackoffNgramMixer with sliding-window eval",
"hardware": "8xH100 SXM"
}
Loading