Skip to content

Ideas braindump #1

@rosikand

Description

@rosikand

Distillation:

allocate part of training time to train a bigger > 16mb model --> distill into student.

  • The core idea: use part of your 10-minute budget to train a larger model (or a wider/deeper version of your model that exceeds 16MB), then spend the remaining time distilling it into the 16MB-compliant student. You're effectively laundering extra capacity through the training process. The soft targets from the teacher carry richer information than the hard token labels — dark knowledge about which wrong tokens are "almost right" — and this is well-established to help small models punch above their weight. The practical version that fits the rules would be something like: spend minutes 0–6 training a 32MB model on FineWeb, then minutes 6–10 training the 16MB student on a mix of hard cross-entropy and KL divergence against the teacher's logits. Everything runs in a single train_gpt.py, fully reproducible. No external data, no pre-computation — just two phases in one script.
  • I also wonder if you can do test-time-distillation (TTD): you have 10 minutes of eval time so obviouslly some form of TTT will win and take advantage of that. Can we do this larger-to-smaller distillation at test time as well?

Systems-side efficiency

  • Systems optimizations basically can get you as close to the lower bound as possible for your given algorithmic design.
  • Helpful to first find good algorithm --> then optimize the system.
  • might be a good idea to run > 10 min or > 16mb for testing out algorithm ideas --> find a good one --> systems optimize it down to the limits.
  • Another perspective: doing intense systems optimizations will allow you to fit the maximum size model, as fast as possible.
  • Two sides to optimize: model size, model throughput: (1) self-explanatory; bigger model, more capacity to learn representations, (2) means we need to maximize the inference/forward pass of the model to pump out as many gradient steps as possible.

Quantization

  • Continuing from above, this is the highest rate of return for a systems side optimization.
  • Quantize large model --> fit it into the 16mb.
  • int6 or int4 quantization → pack more parameters → combine with all the eval tricks (sliding window, long context)

The 1-Bit Paradigm: BitNet b1.58 and Ternary Networks

Fused Triton Megakernels

  • we only have 600 seconds for training. pytorch overhead will be too much.
  • interface directly with the silicon execution units of the 8xH100 SXM cluster.

Eval tricks

Sliding window

  • Baseline chops validation into non-overlapping 1024-token chunks, so tokens get 0–1023 context (avg ~512)
  • Sliding window with stride=64 scores only the rightmost 64 tokens per window, giving every token 960+ context
  • This alone is worth ~0.032 BPB improvement with zero training changes
  • Eval time goes from ~16s to ~70s on 8×H100 — well within the 10-min eval budget
  • The LoRA TTT ablation showed that most of TTT's apparent gains were actually just the strided eval, not the gradient steps
  • Key params: EVAL_STRIDE=64, EVAL_BATCH_SEQS=1024 (batch windows for GPU utilization)
  • Smaller stride = more context per token but slower eval; stride=64 seems like a good tradeoff given the time budget
  • This is now table stakes — every competitive submission will include it

Long context

  • Training at seq_len 2048 or 4096 instead of 1024 helps the model learn longer-range dependencies
  • 4k seq length + tuned hyperparams got 1.2014 BPB (before sliding window eval was even applied on top)
  • Tradeoff: longer sequences = fewer steps in the 10-min budget (each step is slower), so you need to compensate with batch size / LR adjustments
  • The warmdown submission found that well-trained models (many steps) benefit less from aggressive length extrapolation at eval — eval@1408 (1.375×) was optimal vs eval@2048 hurting
  • NTK-RoPE can extend effective context at eval time beyond training length, but gains depend on how well-trained the model is
  • Combining train@2048+ with sliding window eval at that longer length is likely a strong combo that nobody has fully exploited yet
  • Batch token budget may need reduction (the 4k submission used 393k tokens/step instead of 524k) to fit longer sequences in memory

Born-Again Networks

  • train the 16MB model, then re-initialize and train a fresh 16MB model using the first one as teacher. Multiple generations of this can sometimes improve results. The question is whether you have time budget for multiple generations.

Ensembles

  • This is obvious from the kaggle days.
  • Of course, compute is a problem

Data augmentation

  • Obvious one. Expand docs via data aug. see recent megadocs paper.
  • big question whether the augmentation can even be useful in the 10 minutes we have for training.
  • test time augmentation?

prefix cost

  • Recent approach on leaderboard stores weights in the prefix. so the cost is eaten there. but could be better?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions