Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#477
Conversation
….1522) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new 10-minute / 16MB track record submission that introduces a count-initialized exact bigram logit head (with int4 nibble packing) alongside XSA, Partial RoPE, and LN scaling, plus the associated training artifacts.
Changes:
- New
train_gpt.pyimplementing the exact bigram logit head init/quantization path and the model/training configuration used for the run. - Added run artifacts (
train.log,submission.json) and a README describing techniques/results.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py | Implements the submission’s model/training loop, count-based bigram logit initialization, and mixed int6 + int4-packed quantization export. |
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train.log | Captures training/eval output used to report the submission result. |
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/submission.json | Records final metrics and artifact sizing for the submission. |
| records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/README.md | Documents the technique stack, run command, quantization scheme, and results. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raise ValueError("head_dim must be even for RoPE") | ||
| self.use_xsa = use_xsa | ||
| # Partial RoPE: only rotate rope_dims dims (0 = all dims) | ||
| self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim |
There was a problem hiding this comment.
rope_dims is used to build the Rotary cache and to slice Q/K for RoPE, but it isn’t validated/clamped. If ROPE_DIMS is odd or larger than head_dim, Rotary(dim=rope_dims) will produce cos/sin tables that don’t match the sliced tensors, causing a runtime shape error. Consider clamping to min(rope_dims, head_dim) and enforcing rope_dims % 2 == 0 (or defaulting to the nearest lower even value).
| self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim | |
| # Interpret rope_dims <= 0 as "all dims", then clamp to head_dim and enforce even size. | |
| requested_rope_dims = self.head_dim if rope_dims <= 0 else rope_dims | |
| max_valid_dims = min(requested_rope_dims, self.head_dim) | |
| # Ensure rope dimensions are even to match Rotary/cache shapes. | |
| if max_valid_dims % 2 != 0: | |
| max_valid_dims -= 1 | |
| if max_valid_dims <= 0: | |
| raise ValueError( | |
| f"Invalid rope_dims={rope_dims}: must be positive and at least 2, " | |
| f"and no larger than head_dim={self.head_dim}." | |
| ) | |
| self.rope_dims = max_valid_dims |
| ema_enabled = bool(int(os.environ.get("EMA_ENABLED", "0"))) | ||
| ema_decay = float(os.environ.get("EMA_DECAY", 0.997)) | ||
|
|
||
| # Exact bigram logit head (replaces BigramHash when enabled) |
There was a problem hiding this comment.
The comment says the exact bigram logit head “replaces BigramHash when enabled”, but the model still instantiates/uses BigramHashEmbedding whenever bigram_vocab_size > 0 (default 4096), even if BIGRAM_LOGIT_HEAD=1. Either update the comment (and README if needed) or conditionally disable self.bigram when bigram_logit_head is enabled to match the stated behavior.
| # Exact bigram logit head (replaces BigramHash when enabled) | |
| # Exact bigram logit head (can be enabled alongside BigramHash) |
Summary
Novel Contributions
Count-Initialized Exact Bigram Logit Head
A 1024x1024 lookup table initialized from corpus bigram transition probabilities (
B[a,b] = log p(b|a) - log p(b)) before training. Provides a strong Markov prior from step 0 — the neural network only needs to refine it. Applied BEFORE logit softcap. No other submission uses this approach.Int4 Nibble Packing
Custom
pack_i4/unpack_i4functions pack signed int4 values into uint8 bytes (two values per byte). Applied to the bigram logit table, halving its storage from ~1MB to ~524KB.Adopted Techniques (from published papers/PRs)
Results
Test plan
Built on SOTA baseline by @thwu1 (PR #180).
Generated with Claude Code