-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Problem
Submissions share code but not trained weights. Anyone who wants to evaluate, compare, or build on a submission must re-run the full 10-minute 8xH100 training from scratch. This is expensive, wasteful in training credits / favorizes contributorcontributors without H100 access, and blocks a whole category of downstream work.
Why this matters
-
Compute
Every time someone wants to compare against a prior submission, they re-train it. Multiply that across contributors and seeds and we're burning significant H100-hours reproducing identical runs. Even with the compute grant, credits are finite. The current setup favors contributors who can afford to re-run freely over those who need to be deliberate with their budget -
Reusing the research benchmark
This repo is accumulating dozens of small LMs trained on the same data with diverse architectures. With published weights this becomes a standardized collection of tiny LMs useful for interpretability research, architecture comparison, and compression studies. Without weights -
Downstream tooling
Post-training quantization experiments, interpretability tools, distillation, and model merging are all bottlenecked by "step 0: re-train the model." Shipping weights makes these lines of work trivially accessible.
Implementation
- A shared HuggingFace repo (e.g. openai/parameter-golf-weights) where each record submission uploads its final_model.pt and/or the compressed artifact
- The eval harness could upload automatically after a successful run, or submitters could upload manually as part of the PR checklist
- Reasonable overhead: the compressed artifacts are already under 20 MB. Full-precision checkpoints are ~50-100 MB
Open questions
- Required or opt-in?
- Where should it be hosted?
- Compressed artifact or full-precision final_model.pt, or both?