Skip to content

Implement EngramLite + Complementary Training + BackoffNgramMixer stack targeting sub-1.0 BPB#2

Draft
Copilot wants to merge 2 commits intocopilot/brainstorm-novel-approachesfrom
copilot/improve-bpb-to-sub-1-0
Draft

Implement EngramLite + Complementary Training + BackoffNgramMixer stack targeting sub-1.0 BPB#2
Copilot wants to merge 2 commits intocopilot/brainstorm-novel-approachesfrom
copilot/improve-bpb-to-sub-1-0

Conversation

Copy link
Copy Markdown

Copilot AI commented Apr 1, 2026

Implements the three highest-EV innovations from the PR #1 analysis into train_gpt_mlx_kl.py. Combined estimated improvement: ~0.08–0.18 BPB over the 1.10 baseline, targeting sub-1.0.

EngramLite (replaces BigramHash by default)

ENGRAM_LITE_ENABLED=1 (new default). Gated multi-head bigram+trigram hash embedding.

  • K hash heads per n-gram order reduce collision rate without increasing vocab-size tables
  • Per-order learned gate initialized suppressed (sigmoid(-2)≈0.12) — grows only when lookups are trustworthy; suppresses the +0.0049 BPB noise that raw TrigramHash causes (Non-record: 11L XSA-all + Full GPTQ + Selective Pruning (val_bpb=1.1154, 3-seed) openai/parameter-golf#609)
  • Bottleneck projection (embed_dim=256 → vocab_size) regularizes vs. direct mapping
  • Separate ENGRAM_HASH_SIZE (default 8192) knob; BIGRAM_HASH_SIZE still controls plain BigramHash fallback
# Effective capacity: 2 orders × 2 heads × 8192 × 256 = 8.39M params + 0.26M proj
# vs. BigramHash: 16384 × 1024 = 16.78M params (direct)

BackoffNgramMixer (eval-only, zero artifact cost)

NGRAM_MIXER_ENABLED=1. Causal Laplace-smoothed n-gram LM (orders 1–4, configurable) mixed with neural model at eval time.

  • Two-phase eval: (1) sliding-window neural pass stores per-absolute-position NLL; (2) sequential causal pass mixes distributions
  • NLL_mix = -log((1-α)·P_neural + α·P_ngram) — mathematically valid (mixed dist sums to 1)
  • Seeded from pre-built training-data bigram stats (warm start, still causal); then updated from scored val tokens
  • NGRAM_ALPHA=0.25, NGRAM_MAX_ORDER=4

Complementary Training

COMPLEMENT_ALPHA=0.5. Down-weights tokens already well-predicted by bigrams, forcing specialization on hard tokens.

  • Pre-computes bigram stats at startup via vectorized np.bincount on first training shard (~2s)
  • Per-token weight: clip(1 − α·P_bigram(target|ctx), 0.1, 1.0), mean-normalized to preserve effective LR
  • Module-level globals _COMPLEMENT_BIGRAM_PROBS/_COMPLEMENT_ALPHA captured by mx.compile at graph-trace time

SkipGramHashEmbedding (disabled by default)

SKIPGRAM_HASH_SIZE=2048 to enable. Hash of non-contiguous positions (t-1,t-3) and (t-1,t-5) with per-pattern learned gating. Captures structured repetition (HTML, code, prose templates).

Infra changes

  • GPT._apply_hash_biases() helper shared by loss() and token_losses()
  • SplitOptimizers extended to route skipgram_hash.* params (2D→Muon, 1D→Adam)
  • _MIN_COMPLEMENT_WEIGHT = 0.1, _MIN_PROB = 1e-45 named constants
  • n_heads bounds check against available primes in EngramLiteEmbedding.__init__
Original prompt

Analyse this Pull Request and brainstorm ways to improve and push the bpb to sub 1.0 : https://github.com/kailean/parameter-golf/pull/1/files

…Gram for sub-1.0 BPB

Agent-Logs-Url: https://github.com/kailean/parameter-golf/sessions/205d0581-0f54-4b88-b48f-39c66db315dc

Co-authored-by: kailean <49617037+kailean@users.noreply.github.com>
Copilot AI changed the title [WIP] Analyze pull request and brainstorm improvements for BPB Implement EngramLite + Complementary Training + BackoffNgramMixer stack targeting sub-1.0 BPB Apr 1, 2026
Copilot AI requested a review from kailean April 1, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants