Skip to content

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#477

Closed
harsha-gouru wants to merge 1 commit intoopenai:mainfrom
harsha-gouru:submission/2026-03-22_CountInitBigram_XSA_PartialRoPE
Closed

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#477
harsha-gouru wants to merge 1 commit intoopenai:mainfrom
harsha-gouru:submission/2026-03-22_CountInitBigram_XSA_PartialRoPE

Conversation

@harsha-gouru
Copy link
Copy Markdown

Summary

  • val_bpb: 1.1522 (sliding window stride=64, post int5/int6+zstd quantization roundtrip)
  • 10 layers, 512 dim, 8 heads / 4 KV heads, tied embeddings
  • Artifact: 15,384,232 bytes (15.38 MB)

Novel Contributions

Count-Initialized Exact Bigram Logit Head

A 1024x1024 lookup table initialized from corpus bigram transition probabilities (B[a,b] = log p(b|a) - log p(b)) before training. Provides a strong Markov prior from step 0 — the neural network only needs to refine it. Applied BEFORE logit softcap. No other submission uses this approach.

Int4 Nibble Packing

Custom pack_i4/unpack_i4 functions pack signed int4 values into uint8 bytes (two values per byte). Applied to the bigram logit table, halving its storage from ~1MB to ~524KB.

Adopted Techniques (from published papers/PRs)

Results

Steps: 6267 at 95.75 ms/step (8xH100 SXM, 600s wallclock)
Pre-SWA val_bpb: 1.1563
Post-SWA+quant val_bpb: 1.1522
Artifact: 15.38 MB (0.62 MB headroom)

Test plan

  • Verified artifact under 16MB limit
  • Verified post-quant roundtrip bpb matches
  • Ran on 8xH100 SXM within 600s wallclock

Built on SOTA baseline by @thwu1 (PR #180).

Generated with Claude Code

….1522)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 23, 2026 00:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 10-minute / 16MB track record submission that introduces a count-initialized exact bigram logit head (with int4 nibble packing) alongside XSA, Partial RoPE, and LN scaling, plus the associated training artifacts.

Changes:

  • New train_gpt.py implementing the exact bigram logit head init/quantization path and the model/training configuration used for the run.
  • Added run artifacts (train.log, submission.json) and a README describing techniques/results.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py Implements the submission’s model/training loop, count-based bigram logit initialization, and mixed int6 + int4-packed quantization export.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train.log Captures training/eval output used to report the submission result.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/submission.json Records final metrics and artifact sizing for the submission.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/README.md Documents the technique stack, run command, quantization scheme, and results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

raise ValueError("head_dim must be even for RoPE")
self.use_xsa = use_xsa
# Partial RoPE: only rotate rope_dims dims (0 = all dims)
self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rope_dims is used to build the Rotary cache and to slice Q/K for RoPE, but it isn’t validated/clamped. If ROPE_DIMS is odd or larger than head_dim, Rotary(dim=rope_dims) will produce cos/sin tables that don’t match the sliced tensors, causing a runtime shape error. Consider clamping to min(rope_dims, head_dim) and enforcing rope_dims % 2 == 0 (or defaulting to the nearest lower even value).

Suggested change
self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim
# Interpret rope_dims <= 0 as "all dims", then clamp to head_dim and enforce even size.
requested_rope_dims = self.head_dim if rope_dims <= 0 else rope_dims
max_valid_dims = min(requested_rope_dims, self.head_dim)
# Ensure rope dimensions are even to match Rotary/cache shapes.
if max_valid_dims % 2 != 0:
max_valid_dims -= 1
if max_valid_dims <= 0:
raise ValueError(
f"Invalid rope_dims={rope_dims}: must be positive and at least 2, "
f"and no larger than head_dim={self.head_dim}."
)
self.rope_dims = max_valid_dims

Copilot uses AI. Check for mistakes.
ema_enabled = bool(int(os.environ.get("EMA_ENABLED", "0")))
ema_decay = float(os.environ.get("EMA_DECAY", 0.997))

# Exact bigram logit head (replaces BigramHash when enabled)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says the exact bigram logit head “replaces BigramHash when enabled”, but the model still instantiates/uses BigramHashEmbedding whenever bigram_vocab_size > 0 (default 4096), even if BIGRAM_LOGIT_HEAD=1. Either update the comment (and README if needed) or conditionally disable self.bigram when bigram_logit_head is enabled to match the stated behavior.

Suggested change
# Exact bigram logit head (replaces BigramHash when enabled)
# Exact bigram logit head (can be enabled alongside BigramHash)

Copilot uses AI. Check for mistakes.
@harsha-gouru harsha-gouru deleted the submission/2026-03-22_CountInitBigram_XSA_PartialRoPE branch March 23, 2026 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants