Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522) by harsha-gouru · Pull Request #477 · openai/parameter-golf

harsha-gouru · 2026-03-23T00:15:45Z

Summary

val_bpb: 1.1522 (sliding window stride=64, post int5/int6+zstd quantization roundtrip)
10 layers, 512 dim, 8 heads / 4 KV heads, tied embeddings
Artifact: 15,384,232 bytes (15.38 MB)

Novel Contributions

Count-Initialized Exact Bigram Logit Head

A 1024x1024 lookup table initialized from corpus bigram transition probabilities (B[a,b] = log p(b|a) - log p(b)) before training. Provides a strong Markov prior from step 0 — the neural network only needs to refine it. Applied BEFORE logit softcap. No other submission uses this approach.

Int4 Nibble Packing

Custom pack_i4/unpack_i4 functions pack signed int4 values into uint8 bytes (two values per byte). Applied to the bigram logit table, halving its storage from ~1MB to ~524KB.

Adopted Techniques (from published papers/PRs)

XSA on last 4 layers (arxiv:2603.09078)
Partial RoPE (16/64 dims, from PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315)
LN Scale 1/sqrt(layer+1) (from PR Record: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248) #315)
Higher LR (0.025, from multiple top PRs)
SmearGate, OrthoInit, U-Net skips (from baseline)
SWA, int5 MLP / int6 attention, zstd-22 (from baseline)

Results

Steps: 6267 at 95.75 ms/step (8xH100 SXM, 600s wallclock)
Pre-SWA val_bpb: 1.1563
Post-SWA+quant val_bpb: 1.1522
Artifact: 15.38 MB (0.62 MB headroom)

Test plan

Verified artifact under 16MB limit
Verified post-quant roundtrip bpb matches
Ran on 8xH100 SXM within 600s wallclock

Built on SOTA baseline by @thwu1 (PR #180).

Generated with Claude Code

….1522) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new 10-minute / 16MB track record submission that introduces a count-initialized exact bigram logit head (with int4 nibble packing) alongside XSA, Partial RoPE, and LN scaling, plus the associated training artifacts.

Changes:

New train_gpt.py implementing the exact bigram logit head init/quantization path and the model/training configuration used for the run.
Added run artifacts (train.log, submission.json) and a README describing techniques/results.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py	Implements the submission’s model/training loop, count-based bigram logit initialization, and mixed int6 + int4-packed quantization export.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train.log	Captures training/eval output used to report the submission result.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/submission.json	Records final metrics and artifact sizing for the submission.
records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/README.md	Documents the technique stack, run command, quantization scheme, and results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T00:19:09Z

records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py

+            raise ValueError("head_dim must be even for RoPE")
+        self.use_xsa = use_xsa
+        # Partial RoPE: only rotate rope_dims dims (0 = all dims)
+        self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim


rope_dims is used to build the Rotary cache and to slice Q/K for RoPE, but it isn’t validated/clamped. If ROPE_DIMS is odd or larger than head_dim, Rotary(dim=rope_dims) will produce cos/sin tables that don’t match the sliced tensors, causing a runtime shape error. Consider clamping to min(rope_dims, head_dim) and enforcing rope_dims % 2 == 0 (or defaulting to the nearest lower even value).

Suggested change

self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim

# Interpret rope_dims <= 0 as "all dims", then clamp to head_dim and enforce even size.

requested_rope_dims = self.head_dim if rope_dims <= 0 else rope_dims

max_valid_dims = min(requested_rope_dims, self.head_dim)

# Ensure rope dimensions are even to match Rotary/cache shapes.

if max_valid_dims % 2 != 0:

max_valid_dims -= 1

if max_valid_dims <= 0:

raise ValueError(

f"Invalid rope_dims={rope_dims}: must be positive and at least 2, "

f"and no larger than head_dim={self.head_dim}."

)

self.rope_dims = max_valid_dims

Copilot · 2026-03-23T00:19:09Z

records/track_10min_16mb/2026-03-22_10L_CountInitBigram_XSA_PartialRoPE/train_gpt.py

+    ema_enabled = bool(int(os.environ.get("EMA_ENABLED", "0")))
+    ema_decay = float(os.environ.get("EMA_DECAY", 0.997))
+
+    # Exact bigram logit head (replaces BigramHash when enabled)


The comment says the exact bigram logit head “replaces BigramHash when enabled”, but the model still instantiates/uses BigramHashEmbedding whenever bigram_vocab_size > 0 (default 4096), even if BIGRAM_LOGIT_HEAD=1. Either update the comment (and README if needed) or conditionally disable self.bigram when bigram_logit_head is enabled to match the stated behavior.

Suggested change

# Exact bigram logit head (replaces BigramHash when enabled)

# Exact bigram logit head (can be enabled alongside BigramHash)

Record: 10L CountInitBigram + XSA + PartialRoPE + LN Scale (val_bpb=1…

0522cc7

….1522) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 23, 2026 00:15

Copilot started reviewing on behalf of harsha-gouru March 23, 2026 00:16 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

harsha-gouru closed this Mar 23, 2026

harsha-gouru deleted the submission/2026-03-22_CountInitBigram_XSA_PartialRoPE branch March 23, 2026 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#477

Record: 10L CountInitBigram + XSA + PartialRoPE (val_bpb=1.1522)#477
harsha-gouru wants to merge 1 commit intoopenai:mainfrom
harsha-gouru:submission/2026-03-22_CountInitBigram_XSA_PartialRoPE

harsha-gouru commented Mar 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Copilot AI Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        self.rope_dims = rope_dims if rope_dims > 0 else self.head_dim
+        # Interpret rope_dims <= 0 as "all dims", then clamp to head_dim and enforce even size.
+        requested_rope_dims = self.head_dim if rope_dims <= 0 else rope_dims
+        max_valid_dims = min(requested_rope_dims, self.head_dim)
+        # Ensure rope dimensions are even to match Rotary/cache shapes.
+        if max_valid_dims % 2 != 0:
+            max_valid_dims -= 1
+        if max_valid_dims <= 0:
+            raise ValueError(
+                f"Invalid rope_dims={rope_dims}: must be positive and at least 2, "
+                f"and no larger than head_dim={self.head_dim}."
+            )
+        self.rope_dims = max_valid_dims

	# Exact bigram logit head (replaces BigramHash when enabled)
	# Exact bigram logit head (can be enabled alongside BigramHash)

Conversation

harsha-gouru commented Mar 23, 2026

Summary

Novel Contributions

Count-Initialized Exact Bigram Logit Head

Int4 Nibble Packing

Adopted Techniques (from published papers/PRs)

Results

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants