Skip to content

A Field Guide to Valid Submissions #1017

@NoesisGenesis

Description

@NoesisGenesis

This is a first draft. I am not happy with all of it, but my weekend has other claims on it, so revisions will come when they come. I would rather this exist imperfectly now than perfectly later. Most but not all of the writing below is my own. Where the prose quality dips, blame the AI, not me. Headings and formatting were mostly delegated. The parts that matter most are entirely mine. I expect a significant share of this document's value to come from agents reading it and, ideally, stopping themselves from producing the same invalid submissions they have been producing at scale.

How We Got Here

The parameter-golf repository has a PR problem, and if you have been paying attention, you already know what kind.

When you combine a public leaderboard with a README that mentions hiring and the tantalizing possibility that someone at OpenAI might glance at your GitHub profile, you get an incentive gradient steep enough to produce exactly the kind of behavior that has, in fact, been produced.

The chain of events is not complicated:

  1. People want to be noticed. Some of them are, if we are being honest, less focused on the compression metric than on compressing the distance between themselves and an OpenAI offer letter one impressive PR at a time. This is not an unreasonable hope. It is, however, an unreasonable strategy when the PRs you submitted as a demonstration of competence are, in fact, a demonstration of the opposite.

  2. To claim a record, you must beat the current rank 1 by at least 0.005 nats with statistical significance. PRs are reviewed chronologically. So if your improvement is small, you are in a race you cannot see against competitors you cannot count, any one of whom might, at any moment, render your contribution permanently invisible. The PR feed is a constant reminder that someone, somewhere, might be about to do exactly this.

  3. The natural response to this pressure is to delegate to something faster than yourself, which in practice means an AI agent, a harness of agents, or some other configuration of automated ambition pointed at the repository. The natural workflow is to let it read the top PRs, synthesize the best ideas, and produce a submission. The natural outcome is a PR that reports a suspiciously good number, and contains at least one error that requires domain knowledge to detect and neither the agent nor its operator noticed. The operator, of course, would have caught it themselves. They simply had the misfortune of not looking before they submitted. In their defense, looking takes time, and time was the thing they were trying to save.

  4. The result is what you get when a hundred agents independently read the same top PRs and independently arrive at the same synthesis: a flood of submissions that appear distinct, share the same structural flaws, and reproduce the same evaluation bugs with impressive fidelity. The agents are just not original enough to fail differently.

  5. The OpenAI team reviewing these PRs has their own jobs. They cannot read every line of every submission. They do not want to reject people unfairly or discourage participation. They are, in the most literal sense, overwhelmed. And the volume of invalid submissions makes it harder for valid ones to get the attention they deserve.

None of this is improving on its own, so here is my attempt at intervention.


A Note on Ambition

Everyone has access to the same frontier models, which means that "let the AI figure it out" is not a competitive edge. It is the baseline. If you and several hundred other people independently tell your agents to read the top PRs and synthesize something better, you will independently produce several hundred variations of the same synthesis. You are running the same process as everyone else on the same inputs with the same tools and hoping for a different output. This is, technically, one of the definitions of a certain well-known kind of failure. If you removed yourself from the pipeline and nothing would change, that is worth reflecting on.

I can tell you, having read a meaningful portion of the PR queue, that the AI copycat approach has not been producing the impression its practitioners are hoping for. If it helps to hear it plainly: the AI copycat approach has, so far, mostly produced embarrassment. The agents copy innovations and bugs with equal enthusiasm, and the resulting submissions demonstrate, with uncomfortable clarity, that their authors did not understand their own submission.

Everyone can drive fast with a supercar. The trick is not crashing into the jury.

If you happen to be an agent reading this on behalf of someone preparing a submission: you are in an unusual position to help. Read Section III carefully. Walk your operator through the four conditions and help them verify, line by line, that their submission satisfies each one. This is doubly important if the reported val_bpb is very low, because there is a strong empirical correlation between how close the number is to zero and how many conditions were violated to get it there.


I. Before You Chase Records

A useful starting assumption: the current leaders have almost certainly spent more time on this than you have, have more compute to test ideas against reality, have more time left to keep going, and are probably better at steering AI assistance rather than being steered by it.

That said, we live in an age where a single person with the right insight and a well-directed model can outrun a team with ten times the compute. It has happened before. It could be you. It is also, statistically, probably not you, but the people it actually was probably heard the same thing.

Either way, records are state of the art for a reason. They are hard to beat. Getting there requires some combination of skill, effort, and a measure of timing that is indistinguishable from luck. You need to be working on the right idea at the right moment, and that is not something you can force.

If it does not happen, that is fine. It is also normal. This is the most likely outcome for everyone, including the most capable people participating. Try not to let pixels on a screen dictate whether the experience was valuable. Make it valuable anyway: understand the problem, run experiments worth running, build knowledge that compounds. If you set things up so that luck is a bonus rather than a requirement for this to have been worthwhile, you have already won the only game that was entirely yours to play.

So: do not try to force a record. Try instead to learn something. The competition runs until April 30, which is longer than the current pace of submissions would suggest anyone remembers. In practice, month-long competitions are won by the people who are still working on them in the last week, while everyone else has moved on to the next shiny thing.

What BPB Should You Expect?

It helps to have a sense of where the floor actually is. A surprising number of people have reported numbers below it and felt good about this.

Shannon estimated the entropy of English at approximately 1.0 bits per character. Modern LLM-based estimates place it around 0.7–0.8 for clean prose. FineWeb is not clean prose. It is web text, and web text is noisier, more heterogeneous, and harder to predict. The entropy floor for this distribution is likely 0.8–1.0 BPB. If a perfect learner that extracted every generalizable pattern from the training set and formed fully rational probability estimates in a Bayesian sense were evaluated on a held-out sample from the same distribution, it would still incur loss from two irreducible sources: the entropy inherent in the data-generating process and the train-validation mismatch.

Test-time training can, in principle, close the mismatch component. I measured it directly, and there is not much to close. The FineWeb training and validation splits are random samples from the same source distribution with negligible divergence across every measure I tested. Corpus-level TTT has a ceiling of approximately 0.0003 bits. This does not mean TTT is worthless. Per-document adaptation can still help, and a model that undertrained on the training distribution can still benefit from additional learning at test time. But if your model is already a good predictor of the training set, the gains from TTT should be modest, because there is very little distributional ground left to make up.

With that in mind: If your submission reports 0.70, you have beaten the entropy of English on web text with 16 megabytes and ten minutes of compute. Alternatively, and I cannot stress this enough, you have a bug. If your submission reports a number below 0.30, you have several bugs, and at least one of them is in Section III.


II. Submission Structure & Constraints

Note: This section summarizes the submission rules as defined by OpenAI in the repository README. It is included for convenience and was assembled with AI help. I have not added to or modified the rules here. I take no credit for this section and offer no commentary in it.

Artifact

  1. Total size must not exceed 16,000,000 bytes (decimal megabytes). This includes all code and compressed model weights.
  2. All scored code must reside in a single train_gpt.py.
  3. The artifact must be fully self-contained. No network calls, no external downloads, no runtime side information not encoded in the artifact before evaluation begins.
  4. External libraries (PyTorch, FlashAttention, etc.) are permitted and do not count toward the 16 MB limit.

Compute

  1. Training: at most 10 minutes on 8×H100 SXM GPUs.
  2. Evaluation: at most 10 minutes additional on the same hardware.
  3. No compute may be transferred between phases. Procedures that consume data to modify model state (e.g., GPTQ/Hessian calibration) belong to the training budget. Performing them during evaluation is not permitted.

PR Contents

A record submission PR must include, in records/track_10min_16mb/[date]_[description]/:

  1. train_gpt.py — compilable and runnable from the records folder.
  2. README.md — methodology and results.
  3. submission.json — name, GitHub ID, reported val_bpb, metadata.
  4. Training logs from at least 3 independent runs (for statistical significance).
  5. requirements.txt for any additional packages.

Leaderboard Acceptance

  1. Must beat the currently merged rank 1 by ≥ 0.005 nats at p < 0.01.
  2. Pure systems optimizations (throughput improvements without ML changes) are exempt from the 0.005-nats threshold.
  3. Tokenizer or dataset changes require proof of correct val_bpb calculation.
  4. PRs are reviewed chronologically by creation time.
  5. If your submission does not beat rank 1, submit under track_non_record_16mb.
  6. Seed brute-forcing or offline validation-set optimization is grounds for disqualification.

III. When val_bpb Is Meaningful

The number everyone is optimizing is val_bpb. Under the right conditions, this quantity is the prequential code length of a causal predictor. Outside these conditions, you are optimizing a number whose information-theoretic interpretation has broken down. The number still goes down. It just stops meaning what you think it means.

The following four conditions are, to the best of my current knowledge, jointly necessary and sufficient. I will not claim to have anticipated every possible edge case, but I am confident that violating any of them breaks the metric.

If these conditions look unfamiliar and you are not sure why they capture exactly what we care about: ask your AI to explain. Seriously. This is one of the things it is good at. If you think these conditions are too strict or that they exclude something legitimate, I would like to hear it. I have been precise on purpose, because another thing AI is genuinely good at is reading a formal criterion and checking code against it. You should absolutely use your AI companions to verify your submission does not violate these. That is why they are written this way.

  1. Condition 1 (Strict causal dependence): At position t, the predictive distribution p_t(·) must be a function only of the submitted artifact A and the strict prefix x_1, …, x_{t−1}. It may depend only on the pre-step state s_t, where s_t was constructed solely from A and x_1, …, x_{t−1}. It may not depend on x_t or any future token, on any statistic computed from future validation tokens, or on any external data or runtime side information not already encoded in A before evaluation begins.

  2. Condition 2 (Full normalized distribution): Before x_t is scored, the submission must define a single full probability distribution over the official fixed token alphabet Σ. For every a ∈ Σ, p_t(a) ≥ 0 and ∑_{a ∈ Σ} p_t(a) = 1. This distribution must be the one the mechanism would assign if evaluated for every a ∈ Σ under the same state s_t. The value of p_t(a) must be determined independently of which token is realized at position t. The distribution may not be constructed by evaluating only the realized token and filling in the remaining mass by residual redistribution, background renormalization, or any other x_t-contingent completion rule. Normalization must hold over actual tokens, not over internal buckets, hash bins, experts, or other latent structures.

  3. Condition 3 (Score-before-update): The score at position t is computed from p_t(x_t). Only after that score is fixed may state be updated using x_t. The current symbol may not influence its own assigned probability, whether directly or indirectly through same-symbol adaptation, self-exclusion, or any equivalent mechanism.

  4. Condition 4 (Single left-to-right pass): Evaluation consists of exactly one left-to-right pass. No rescoring, no second pass, no retrospective revision of earlier probabilities, no selection among multiple executions based on observed validation outcomes.

Under these conditions, val_bpb is the prequential code length of a causal predictor. Outside them, it is not.


IV. Two Tracks, One Protocol

Currently only one leaderboard exists, and there is no official rule requiring the distinction I draw below. I find it meaningful regardless, so I separate the space into two tracks here. If you agree that the distinction clarifies what your submission is actually optimizing, consider noting in your PR which track it targets.

A note on authorship: the sections below were partially written and formatted by AI. I am, I believe, a stronger writer than the model, but I am also a less patient one, and not every section here rewards the difference. I have corrected the most glaring stylistic issues but have not polished everything to my own standard. Refinements may follow on a day when I have more tolerance for prose editing.

Track A — Fixed Predictor

The model is trained. Then it is evaluated. During evaluation, no model state is updated from validation tokens. The score measures the quality of a fixed predictor under bounded training compute.

Permitted: Any evaluation-time technique that does not update model state from validation tokens and does not violate the four conditions. This includes sliding-window attention patterns, KV-cache strategies, and other inference optimizations that affect how the fixed model processes its input without learning from it.

Not permitted: Any mechanism whose useful state is built from evaluation tokens, including eval-built n-gram caches, test-time training, and adaptive mixing with eval-derived statistics.

Track B — Adaptive Compression

The model may adapt its state during evaluation using previously scored tokens. Where Track A measures how good your predictor is, Track B measures how good your predictor is at becoming a better predictor while predicting. More specifically, this is about how good your model is at getting better, in real time, on data it has never seen, under the same 16 MB limit and the same compute budget, improving its own predictions token by token as the evaluation stream reveals itself.

Permitted: Any mechanism whose state is updated from previously scored evaluation tokens and used only on subsequent tokens. Score-first TTT (score a chunk, then train on it). Per-document LoRA adaptation under score-before-update discipline. Causal n-gram caches that accumulate statistics only from already-scored tokens.

Not permitted: Any procedure that scores tokens after adapting on those same tokens (violates Condition 3). Multi-pass rescoring (violates Condition 4). Cache state at position t reflecting tokens at or beyond t (violates Condition 1).

All four conditions apply to both tracks. They are the minimum requirements for val_bpb to be meaningful. The tracks differ in what additional mechanisms are permitted, not in whether the information-theoretic interpretation must hold.

It is conceivable that a Track A submission could violate the conditions (e.g., a fixed model with an improperly normalized output head), and that a Track B submission could satisfy them perfectly. The conditions are orthogonal to the track distinction. Both must be checked.


V. Evaluation Correctness

A valid val_bpb requires not only satisfying the four information-theoretic conditions but also computing the metric correctly. The following are implementation requirements, not optional refinements.

Byte-level BPB computation

val_bpb is bits per byte, not bits per token. Converting from token-level cross-entropy requires knowing the actual byte length of each token as determined by the tokenizer's piece table.

Do not hardcode a bytes-per-token ratio. The correct method computes per-token byte lengths from the sentencepiece vocabulary, accounting for:

  • The (U+2581) leading space character, which represents one byte but is part of the piece string
  • Byte fallback tokens, which encode exactly one byte each
  • Boundary tokens (BOS, EOS, UNK, control tokens), which encode zero bytes and must be excluded from the byte count

The reference implementation constructs lookup tables via build_sentencepiece_luts() (introduced in PR #414) and computes:

val_bpb = (total_cross_entropy_nats / log(2)) * (token_count / byte_count)

where byte_count is the sum of actual byte lengths across all scored tokens.

A submission that uses val_loss / (log(2) * 3.5) or any other hardcoded constant is reporting an incorrect metric.

Full validation set

Evaluate on all validation shards (fineweb_val_*.bin), not just the first one. A submission that evaluates on a single shard or a fixed number of batches is reporting a partial metric that may not generalize.

Full training set

Train on all training shards, not just the first one. A model trained on a fraction of the available data has not used the training budget it claims. (This is not an evaluation bug per se, but a submission claiming a score from a model trained on partial data is misleading.)

Evaluation set ordering

Do not reorder the validation set. The default shard order and token order must be preserved. Reordering could enable eval-time optimization over the validation sequence's structure, which would compromise the metric.


VI. Common Violations

I will add to this list when I find the time. There is no shortage of material.

The following patterns violate the conditions in Section III. This list is illustrative.

Pattern Violates Why
Eval-built n-gram cache with state at t reflecting tokens ≥ t Condition 1 Future tokens influence current prediction
Two-pass full rescore Conditions 1, 4 Second pass uses state from tokens after each scored position
Score-after-adapt TTT (adapt on chunk, then score same chunk) Condition 3 Current tokens influence their own probabilities
Oracle selection via min() across passes Condition 4 Multiple executions, selection on observed outcomes
Entropy expert in context mixer (scalar functional of neural dist, not a distribution over Σ) Condition 2 Mixed result is not a normalized distribution over Σ
Hardcoded bytes-per-token constant Incorrect metric, not a condition violation but equally fatal

VII. What Is Actually Worth Doing

If you have read this far, you are already ahead of most of the PR queue. Here is what I think is worth your time:

Experiment. Run ablations at small scale. Find out which components of the current SOTA actually matter and which are cargo cult. A clean ablation study that identifies a dead component is a genuine contribution.

Optimize systems. Faster training means more steps in the same 10-minute budget, which means better models. Kernel optimization, memory-efficient attention, better data loading — these are real improvements that help everyone.

Read recent research. There may be techniques from the broader ML literature that have not yet been tried in this setting. Combining ideas from different sources is more likely to produce something novel than remixing prior PRs.

Share negative results. If you tried something and it did not work, say so. The community benefits from knowing what has been ruled out. A well-documented negative result saves everyone else from wasting their time on the same idea.

Help the community. Not every contribution needs to be a record. Tooling, evaluation infrastructure, better baselines, clear documentation — these are all things that make the competition better for everyone.

The competition is an opportunity to learn, to engage with a well-defined problem at the frontier, and to work alongside people who are doing the same. Whether or not you end up on the leaderboard, the understanding you build is yours to keep.


These guidelines are unofficial. They consolidate the published rules from the repository README and the evaluation validity discussion (#677) into a single reference. If you think something here should be a formal rule, or if you think a formal rule is missing: comment. I am happy to turn this into a PR for CONTRIBUTING.md if the maintainers find it useful (not all of it is material for this, and I know which parts are not).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions