Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)#764
Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)#764ndokutovich wants to merge 2 commits intoopenai:mainfrom
Conversation
…bpb=0.9633, 1 seed)
|
Great work — the curriculum learning via shard reordering is a clever zero-code-change technique, and appreciate the citation on the LeakyReLU(0.9)² sweep. Just a note: the submission currently has 1 seed with 2 more pending your compute grant. The leaderboard requires 3-seed validation for record claims. Hopefully the grant comes through soon — would be good to see this fully validated. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
|
Following up on this one with a new finding, since @valerio-oai ruled on the underlying n-gram mechanism after my first comment. Compliance flag — same disallowed pattern as PR #779. @valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." Mechanism explanation is in comment 4146407380: hashing the target token into the bucket key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data — arbitrarily low BPB without real compression. The PR body itself documents the smoking-gun signature without needing to fetch code: this submission reports @ndokutovich — could you confirm whether the 7-gram backoff implementation in The procedural seed-count question from my first comment still stands as well — happy to take another look once 3 seeds are filled in, but the n-gram path will need the same fix as the rest of the cluster regardless of how the seed question lands. Reviewed by @MatoTeziTanka — The Agora. Follow-up to my 2026-03-26 comment, prompted by @valerio-oai's PR #779 ruling and the family-wide audit pass on 2026-04-11. AI tooling: review drafted with Claude Code (Sonnet/Opus); the family-bug pattern was verified in code on the 9 sibling PRs, the pre/post-cache delta on this PR was read directly from the published PR body. |
|
Yes, the 7-gram backoff implementation uses the same target-in-key hashing pattern. The cache accounts for the entire −0.1583 BPB delta, as you correctly identified. Since the March 27 ruling, we've moved to the SP8192 track — the curriculum learning and LeakyReLU(0.9)² sweep were early-stage experiments that informed our later work. Happy to close this PR if it's cleaner for the leaderboard; the 1.1216 base isn't competitive on the current SP1024 frontier anyway. Thanks for the thorough review and the constructive suggestion. |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for confirming and for the transparent response. Closing makes sense — the 1.1216 base is solid work on its own and the curriculum learning / LeakyReLU(0.9)² contributions are real. If you revisit on the SP8192 track with a context-only key or full-vocabulary reweighting, happy to take another look. Good luck with the new direction. |
|
Closing as agreed — the 7-gram backoff uses the target-in-key hashing pattern disallowed in the March 27 ruling. The curriculum learning and LeakyReLU(0.9)² contributions live on in our SP8192 work. Thanks @MatoTeziTanka for the review. |
Summary
val_bpb = 0.9633 (seed 42, additional seeds pending compute grant) | 15.56 MB | 8xH100 SXM, 600s
Built on PR #753 (Podracing II) with two novel additions:
1. Curriculum Learning (Shard Reordering)
Training shards reordered by model perplexity — hardest shards first. Based on PR #650 (-0.003 BPB). Zero code change, environment variable only.
2. LeakyReLU(0.9)² Slope Optimization
Following @MatoTeziTanka's controlled sweep (issue #140): slope 0.9 gives -0.013 BPB vs standard 0.5. One parameter change.
Results
Artifact: 15,560,351 bytes (< 16MB)
Steps: 6,647 at 90.3ms/step
GPTQ calibration within training budget (issue #677 compliant)
Reproduction
Acknowledgments
@newjordan (PR #753), @abaybektursun (PR #650), @MatoTeziTanka (slope sweep), @Asukabot0 (n-gram backoff)
Status
1 seed submitted. 2 additional seeds pending OpenAI compute grant.
Previously PR #486 (formerly #2 on leaderboard, TrigramHash originator). $339 personal compute spent.
Test plan