Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
## Record: 11L Depth Recurrence + EMA Tuning (0.9965) (val_bpb: 1.0925)

**val_bpb: 1.0925** (sliding window stride=64, 3-seed mean) | **15.95 MB** (mean) | 8xH100 SXM, 590s

### Key Innovation Over PR #1334

Hyperparameter refinement on the EMA decay constant, built on PR #1334's (@aryanbhosale) depth recurrence architecture:

Comment on lines +1 to +8
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README labels this as a “Record” and frames it as an improvement over PR #1334, but the PR metadata you reference lists PR #1334 with a lower (better) val_bpb (1.0897). Please clarify the baseline/track comparison or adjust the wording so the record claim is unambiguous and consistent with the referenced results.

Copilot uses AI. Check for mistakes.
| Change | PR #1334 | This | Impact |
|--------|----------|------|--------|
| **EMA decay** | 0.997 | 0.9965 | Stabilized post-quantization performance, reduced destructive pruning |

Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Markdown table formatting uses double leading pipes (||) which renders as an empty first column on GitHub. Use single pipes (|) for standard table syntax so the comparison table renders correctly.

Copilot uses AI. Check for mistakes.
### EMA Decay Tuning

By lowering the EMA decay from 0.997 to 0.9965, the exponential moving average assigns slightly more weight to recent training steps. This produces a final checkpoint that quantizes more cleanly under GPTQ int6, reducing the number of values requiring selective pruning (~290K vs baseline).

### Results (3 seeds, 8xH100 SXM)

| Seed | Pre-quant BPB | Sliding BPB (s64) | Artifact |
|------|---------------|-------------------|----------|
| 42 | 1.0965 | **1.0921** | 15,954,858 B |
| 1337 | 1.0973 | **1.0928** | 15,959,674 B |
| 2024 | 1.0969 | **1.0926** | 15,948,766 B |

**Mean: 1.0925 | Std: 0.0004** | All artifacts under 16,000,000 bytes

### Architecture (from PR #1334)

- 11 transformer layers, 512-dim, 8 heads (4 KV heads, GQA)
- Depth recurrence: layers 4,5 repeat (virtual 13 layers), activated at step 3000
- Skip gates (learnable residual gating)
- Shared Value Embedding (dim=128, layers 9,10)
- Tied embeddings, logit softcap=30.0
- SP4096 tokenizer (SentencePiece BPE)

### Training

- FlashAttention 3 (Hopper-optimized)
- Muon optimizer (matrices): lr=0.02, momentum=0.99, WD=0.09, backend_steps=5
- Adam (head params): lr=0.008, fused=True
- AdamW (embeddings): lr=0.6, WD=0.09, fused=True
- AdamW (scalars): lr=0.02, WD=0.02, fused=True
- Gradient clip: 0.3
- Batch: 786,432 tokens/step, seq_len=2048
- Warmdown: 66.7% of training
- **EMA**: decay=0.9965, every step
- Wallclock cap: 600s (590s effective, 10s reserved for GPTQ)

### Quantization

- GPTQ int6 with percdamp=0.05, 64 calibration batches
- Selective pruning of lowest-error values to fit 16MB
- Brotli compression
- ~290K values pruned (minimal impact)

### Reproducibility

All 3 seeds produce valid artifacts under 16MB with tight variance (std=0.0004 BPB). Training completes in ~590s with ~5200-5400 steps depending on seed.

### Attribution

Base architecture and training recipe from PR #1334 by @aryanbhosale.
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"author": "Abhishek Leji",
"github_id": "X-Abhishek-X",
"name": "Record: 11L Depth Recurrence + EMA Tuning (0.9965)",
"blurb": "EMA decay tuned to 0.9965 for stabilized post-quantization performance, built on PR #1334 (aryanbhosale) depth recurrence architecture (11L, skip gates, VE128, GPTQ int6+brotli, sliding window eval).",
"date": "2026-04-06T00:00:00Z",
"val_loss": 2.51365112,
"val_bpb": 1.09254468,
Copy link

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

submission.json reports val_bpb=1.09254468, but the three included final_int6_sliding_window val_bpb values in the logs (1.09211068, 1.09276612, 1.09255323) average to ~1.09247668. Please reconcile this number (update val_bpb or document how it was computed).

Suggested change
"val_bpb": 1.09254468,
"val_bpb": 1.09247668,

Copilot uses AI. Check for mistakes.
"bytes_total": 15954433
}
Loading
Loading