Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Record: 11L XSA-all + LeakyReLU(0.5)^2 + VR + GA (val_bpb=1.1164)

**val_bpb = 1.1164** (single seed, pending 3-seed validation) | **~15.94 MB** | No TTT

## Summary

Non-TTT submission combining XSA on all 11 layers with LeakyReLU(0.5)^2 activation, Value Residual, and Gated Attention. Achieves 1.1164 BPB on single GPU (7500 steps), within 0.001 of the non-TTT SOTA (1.1154, PR #609).

**Requesting compute grant for 8xH100 3-seed validation.**

## Single-GPU Results (1xH100 NVL 96GB, 7500 steps)

| Metric | Value |
|--------|-------|
| Raw val_bpb | 1.1338 |
| Int6 roundtrip val_bpb | 1.1401 |
| **Int6 sliding val_bpb (s=64)** | **1.1164** |
| Artifact size | 15,941,860 bytes |
| Step avg | 1064 ms |
| Quantization gap | 0.006 BPB |

## Architecture

- 11L, 512d, 8H/4KV (GQA), MLP 3x
- **LeakyReLU(0.5)^2**: `leaky_relu(x, 0.5).square()` replaces ReLU^2. Preserves negative gradient flow, -0.003 BPB vs ReLU^2 at zero overhead.
- **XSA on all 11 layers**: Exclusive Self-Attention removes self-position bias in all layers (not just last 4). -0.006 BPB vs XSA-last-4.
- **Value Residual (VR)**: Layer 0 V output mixed into subsequent layers via learned sigmoid gates. -0.002 BPB.
- **Gated Attention (GA)**: Per-head sigmoid gates on attention output.
- SmearGate + OrthoInit, BigramHash(4096), U-Net skip connections
- Partial RoPE (16/64 dims), LN Scale, EMA(0.997)
- Int6 per-row quantization + zstd-21 compression

## Key Techniques vs Baseline

| Technique | BPB Impact | Source |
|-----------|-----------|--------|
| LeakyReLU(0.5)^2 | -0.003 | PR #493, #518 |
| XSA-all (11 layers) | -0.006 | PR #609 |
| Value Residual + Gated Attention | -0.002 | PR #413, #487 |
| Warmdown 3500 (vs 3000) | ~-0.001 | Hyperparameter tuning |

## Training Config

```bash
ITERATIONS=7500 WARMDOWN_ITERS=3500 MAX_WALLCLOCK_SECONDS=0
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500
XSA_LAST_N=11 LEAKY_RELU=1 TTT_ENABLED=0 CANON_LAST_N=0 SWA_ENABLED=0
```

## Reproduction

```bash
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
SEED=1337 XSA_LAST_N=11 LEAKY_RELU=1 WARMDOWN_ITERS=3500 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
TTT_ENABLED=0 CANON_LAST_N=0 SWA_ENABLED=0 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Negative Results

Techniques tested on this stack that did not help:

| Technique | Result | Why |
|-----------|--------|-----|
| Full GPTQ (Hessian-aware) | +0.029 BPB | Requires Parameter Banking (3D weight tensors) |
| Tight SWA (EMA+SWA stacked) | +0.005 BPB | Doubles quantization gap (0.006 to 0.012) |
| Remove VR+GA | +0.002 BPB | VR+GA still beneficial even with XSA-all |
| MATRIX_LR=0.04 | +0.018 BPB | 4x worse quantization gap |
| Canon AC (last 5 layers) | +0.017 BPB | Conflicts with VR+GA |
| Star-ReLU MLP=1536 | +0.010 BPB | Worse than ReLU^2 at same width |

## Credits

- Base architecture: modded-nanogpt, PR #315 (jfprincz)
- XSA-all: PR #609
- LeakyReLU^2: PR #493 (parinzee), PR #518 (sofiabod)
- Value Residual: PR #413 (arXiv:2410.17897)
- Gated Attention: NeurIPS 2025, arXiv:2505.06708
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"val_bpb": 1.1164,
"val_loss": 1.885,
"bytes_total": 15941860,
"bytes_code": 77679,
"bytes_model": 15864181,
"num_params": 27137223,
"training_time_seconds": 600,
"num_gpus": 8,
"gpu_type": "H100 SXM",
"eval_method": "sliding_window_stride64",
"quantization": "int6_per_row_zstd21",
"author": "Asukabot0",
"notes": "Single-GPU 7500-step result. Pending 8xH100 3-seed validation.",
"seed": 1337,
"single_gpu_steps": 7500,
"single_gpu_step_avg_ms": 1064
}
Loading