Distillation:
allocate part of training time to train a bigger > 16mb model --> distill into student.
- The core idea: use part of your 10-minute budget to train a larger model (or a wider/deeper version of your model that exceeds 16MB), then spend the remaining time distilling it into the 16MB-compliant student. You're effectively laundering extra capacity through the training process. The soft targets from the teacher carry richer information than the hard token labels — dark knowledge about which wrong tokens are "almost right" — and this is well-established to help small models punch above their weight. The practical version that fits the rules would be something like: spend minutes 0–6 training a 32MB model on FineWeb, then minutes 6–10 training the 16MB student on a mix of hard cross-entropy and KL divergence against the teacher's logits. Everything runs in a single train_gpt.py, fully reproducible. No external data, no pre-computation — just two phases in one script.
- I also wonder if you can do test-time-distillation (TTD): you have 10 minutes of eval time so obviouslly some form of TTT will win and take advantage of that. Can we do this larger-to-smaller distillation at test time as well?
Systems-side efficiency
- Systems optimizations basically can get you as close to the lower bound as possible for your given algorithmic design.
- Helpful to first find good algorithm --> then optimize the system.
- might be a good idea to run > 10 min or > 16mb for testing out algorithm ideas --> find a good one --> systems optimize it down to the limits.
- Another perspective: doing intense systems optimizations will allow you to fit the maximum size model, as fast as possible.
- Two sides to optimize: model size, model throughput: (1) self-explanatory; bigger model, more capacity to learn representations, (2) means we need to maximize the inference/forward pass of the model to pump out as many gradient steps as possible.
Quantization
- Continuing from above, this is the highest rate of return for a systems side optimization.
- Quantize large model --> fit it into the 16mb.
- int6 or int4 quantization → pack more parameters → combine with all the eval tricks (sliding window, long context)
The 1-Bit Paradigm: BitNet b1.58 and Ternary Networks
Fused Triton Megakernels
- we only have 600 seconds for training. pytorch overhead will be too much.
- interface directly with the silicon execution units of the 8xH100 SXM cluster.
Eval tricks
Sliding window
- Baseline chops validation into non-overlapping 1024-token chunks, so tokens get 0–1023 context (avg ~512)
- Sliding window with stride=64 scores only the rightmost 64 tokens per window, giving every token 960+ context
- This alone is worth ~0.032 BPB improvement with zero training changes
- Eval time goes from ~16s to ~70s on 8×H100 — well within the 10-min eval budget
- The LoRA TTT ablation showed that most of TTT's apparent gains were actually just the strided eval, not the gradient steps
- Key params:
EVAL_STRIDE=64, EVAL_BATCH_SEQS=1024 (batch windows for GPU utilization)
- Smaller stride = more context per token but slower eval; stride=64 seems like a good tradeoff given the time budget
- This is now table stakes — every competitive submission will include it
Long context
- Training at seq_len 2048 or 4096 instead of 1024 helps the model learn longer-range dependencies
- 4k seq length + tuned hyperparams got 1.2014 BPB (before sliding window eval was even applied on top)
- Tradeoff: longer sequences = fewer steps in the 10-min budget (each step is slower), so you need to compensate with batch size / LR adjustments
- The warmdown submission found that well-trained models (many steps) benefit less from aggressive length extrapolation at eval — eval@1408 (1.375×) was optimal vs eval@2048 hurting
- NTK-RoPE can extend effective context at eval time beyond training length, but gains depend on how well-trained the model is
- Combining train@2048+ with sliding window eval at that longer length is likely a strong combo that nobody has fully exploited yet
- Batch token budget may need reduction (the 4k submission used 393k tokens/step instead of 524k) to fit longer sequences in memory
Born-Again Networks
- train the 16MB model, then re-initialize and train a fresh 16MB model using the first one as teacher. Multiple generations of this can sometimes improve results. The question is whether you have time budget for multiple generations.
Ensembles
- This is obvious from the kaggle days.
- Of course, compute is a problem
Data augmentation
- Obvious one. Expand docs via data aug. see recent megadocs paper.
- big question whether the augmentation can even be useful in the 10 minutes we have for training.
- test time augmentation?
prefix cost
- Recent approach on leaderboard stores weights in the prefix. so the cost is eaten there. but could be better?
Distillation:
Systems-side efficiency
Quantization
The 1-Bit Paradigm: BitNet b1.58 and Ternary Networks
Fused Triton Megakernels
Eval tricks
Sliding window
EVAL_STRIDE=64,EVAL_BATCH_SEQS=1024(batch windows for GPU utilization)Long context
Born-Again Networks
Ensembles
Data augmentation
prefix cost