Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# DominationV2 + BOS-Reset Bigram Cache + TTT

**val_bpb: 1.1382** (3-seed mean, std 0.0010) | **~15.5 MB** | 8xH100 SXM

## Results (8xH100 80GB SXM, PyTorch 2.9.1+cu128)

| Seed | step_avg | steps | val_bpb | Artifact |
|------|----------|-------|---------|----------|
| 1337 | 69.7ms | 8,611 | **1.1371** | 15,504,722 |
| 42 | 69.8ms | 8,605 | **1.1385** | 15,579,418 |
| 2025 | 69.7ms | 8,621 | **1.1389** | 15,505,762 |
| **Mean** | **69.7ms** | **8,612** | **1.1382** | |

### Timing Budget

| Phase | Time |
|-------|------|
| Training (8,611 steps @ 69.7ms) | 600s |
| TTT (3 epochs) | ~10s |
| Sliding window + cache eval | ~223s |
| **Total eval** | **~233s** |

## BOS-Reset Bigram Cache

An eval-time bigram cache applied during sliding window evaluation, after quantization roundtrip and TTT.

For each scored token, the cache tracks bigram counts from already-scored tokens within the current document and blends with model probabilities:

```
p_final = (1 - alpha_eff) * p_model + alpha_eff * p_cache

p_cache = count(prev, target) / count(prev)
alpha_eff = 0.20 * count / (count + 8) scales with observed data
alpha_eff *= (entropy / max_entropy) higher when model is uncertain
```

Cache resets at every BOS token (document boundary). Updated only after each token is scored (score-first, same ordering as TTT in PR #549).

## Architecture

DominationV2 stack:

| Component | Setting |
|-----------|---------|
| Layers | 11 (512d, 8H, 4KV) |
| MLP | 3x relu² |
| U-Net | 5 encoder + 6 decoder with skip connections |
| XSA | Last 4 layers |
| SmearGate | Per-dimension blend with previous token |
| BigramHash | 2048 buckets, dim=128 |
| OrthoInit | Orthogonal init with depth scaling |
| EMA | Decay=0.997 |
| Quantization | Mixed int6/int8 + zstd-22 |
| TTT | 3 epochs, lr=1e-4 |

### Cache Settings

| Parameter | Value |
|-----------|-------|
| CACHE_ALPHA | 0.20 |
| CACHE_TAU | 8.0 |
| CACHE_ENTROPY_POWER | 1.0 |
| Eval stride | 64 |

## Run Command

```bash
python3 data/cached_challenge_fineweb.py --variant sp1024
pip install zstandard

cd records/track_10min_16mb/2026-03-27_DominationV2_BigramCache_TTT

DATA_PATH=../../data/datasets/fineweb10B_sp1024 \
TOKENIZER_PATH=../../data/tokenizers/fineweb_1024_bpe.model \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Credits

- DominationV2 base: built on upstream PR #64 and PR #198
- Bigram cache: inspired by classical cache language models (Grave et al., 2016)
- TTT: adapted from PR #461
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
{
"author": "Shouryamaan Jain",
"github_id": "shouryamaanjain",
"name": "DominationV2 + BOS-Reset Bigram Cache + TTT",
"blurb": "DominationV2 (11L, 3x relu², XSA-4, EMA, SmearGate, BigramHash, OrthoInit, mixed int6/int8 + zstd-22) with eval-time BOS-reset bigram cache after TTT. Cache builds document-local bigram counts from already-scored tokens, blended with model probabilities gated by entropy. 3-seed mean: 1.1382 (std 0.0010).",
"date": "2026-03-27",
"val_loss": 1.91991417,
"val_bpb": 1.13708132,
"bytes_total": 15504722,
"seeds": {
"1337": {"val_loss": 1.91991417, "val_bpb": 1.13708132, "bytes": 15504722},
"42": {"val_loss": 1.92231620, "val_bpb": 1.13850394, "bytes": 15579418},
"2025": {"val_loss": 1.92302940, "val_bpb": 1.13892633, "bytes": 15505762}
},
"mean_bpb": 1.1382,
"std_bpb": 0.0010
}
Loading