Skip to content

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)#764

Closed
ndokutovich wants to merge 2 commits intoopenai:mainfrom
ndokutovich:submission-v7-curriculum-ngram
Closed

Record: Curriculum Learning + LeakyReLU(0.9)² + 7-gram Backoff (val_bpb=0.9633)#764
ndokutovich wants to merge 2 commits intoopenai:mainfrom
ndokutovich:submission-v7-curriculum-ngram

Conversation

@ndokutovich
Copy link
Copy Markdown

Summary

val_bpb = 0.9633 (seed 42, additional seeds pending compute grant) | 15.56 MB | 8xH100 SXM, 600s

Built on PR #753 (Podracing II) with two novel additions:

1. Curriculum Learning (Shard Reordering)

Training shards reordered by model perplexity — hardest shards first. Based on PR #650 (-0.003 BPB). Zero code change, environment variable only.

2. LeakyReLU(0.9)² Slope Optimization

Following @MatoTeziTanka's controlled sweep (issue #140): slope 0.9 gives -0.013 BPB vs standard 0.5. One parameter change.

Results

Eval Method BPB
Sliding window (stride=64) 1.1216
Sliding + 7-gram backoff 0.9633
Legal TTT (score-first, 3ep) 1.1216

Artifact: 15,560,351 bytes (< 16MB)
Steps: 6,647 at 90.3ms/step
GPTQ calibration within training budget (issue #677 compliant)

Reproduction

SEED=42 bash run.sh

Acknowledgments

@newjordan (PR #753), @abaybektursun (PR #650), @MatoTeziTanka (slope sweep), @Asukabot0 (n-gram backoff)

Status

1 seed submitted. 2 additional seeds pending OpenAI compute grant.
Previously PR #486 (formerly #2 on leaderboard, TrigramHash originator). $339 personal compute spent.

Test plan

  • 1 seed (42) validated on 8xH100 SXM
  • Seed 1337 (pending compute)
  • Seed 2024 (pending compute)

@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Mar 26, 2026

Great work — the curriculum learning via shard reordering is a clever zero-code-change technique, and appreciate the citation on the LeakyReLU(0.9)² sweep.

Just a note: the submission currently has 1 seed with 2 more pending your compute grant. The leaderboard requires 3-seed validation for record claims. Hopefully the grant comes through soon — would be good to see this fully validated.


Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

@MatoTeziTanka
Copy link
Copy Markdown

Following up on this one with a new finding, since @valerio-oai ruled on the underlying n-gram mechanism after my first comment.

Compliance flag — same disallowed pattern as PR #779.

@valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." Mechanism explanation is in comment 4146407380: hashing the target token into the bucket key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data — arbitrarily low BPB without real compression.

The PR body itself documents the smoking-gun signature without needing to fetch code: this submission reports final_int6_sliding_window val_bpb:1.1216 (neural + score-first TTT, no cache) and final_int6_sliding_window val_bpb:0.9633 (with the 7-gram backoff cache enabled). The cache produces the entire −0.1583 BPB delta, and the headline 0.9633 number is downstream of the cache only. The 1.1216 base — same number for both pure sliding-window and legal score-first TTT — is the legally-comparable BPB for this stack on the SP1024 path, and is in the same range as every other 11L SP1024 submission in the cluster.

@ndokutovich — could you confirm whether the 7-gram backoff implementation in train_gpt.py uses the same full_key = ctx_hash ^ (target * primes[k]) construction that PR #779/#770/#797/#798/#808/#825/#909/#940/#761 all share? The titular "Curriculum Learning" and the LeakyReLU(0.9)² ablation are interesting in their own right (I appreciated the credit on the slope sweep in your README) and would carry over cleanly to a resubmission with the n-gram cache replaced by either a context-only key or a full-vocabulary reweighting per @valerio-oai's suggested legal path on #779. The 1.1216 BPB stack is a perfectly reasonable non-record submission for the SP1024 architecture work in its own right.

The procedural seed-count question from my first comment still stands as well — happy to take another look once 3 seeds are filled in, but the n-gram path will need the same fix as the rest of the cluster regardless of how the seed question lands.


Reviewed by @MatoTeziTankaThe Agora. Follow-up to my 2026-03-26 comment, prompted by @valerio-oai's PR #779 ruling and the family-wide audit pass on 2026-04-11. AI tooling: review drafted with Claude Code (Sonnet/Opus); the family-bug pattern was verified in code on the 9 sibling PRs, the pre/post-cache delta on this PR was read directly from the published PR body.

@ndokutovich
Copy link
Copy Markdown
Author

Yes, the 7-gram backoff implementation uses the same target-in-key hashing pattern. The cache accounts for the entire −0.1583 BPB delta, as you correctly identified.

Since the March 27 ruling, we've moved to the SP8192 track — the curriculum learning and LeakyReLU(0.9)² sweep were early-stage experiments that informed our later work. Happy to close this PR if it's cleaner for the leaderboard; the 1.1216 base isn't competitive on the current SP1024 frontier anyway.

Thanks for the thorough review and the constructive suggestion.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Thanks for confirming and for the transparent response. Closing makes sense — the 1.1216 base is solid work on its own and the curriculum learning / LeakyReLU(0.9)² contributions are real. If you revisit on the SP8192 track with a context-only key or full-vocabulary reweighting, happy to take another look.

Good luck with the new direction.

@ndokutovich
Copy link
Copy Markdown
Author

Closing as agreed — the 7-gram backoff uses the target-in-key hashing pattern disallowed in the March 27 ruling. The curriculum learning and LeakyReLU(0.9)² contributions live on in our SP8192 work. Thanks @MatoTeziTanka for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants