Skip to content

VatsaDev/NanoPoor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 

Repository files navigation

NanoPoor

NanoGPT-speedrunning for the poor T4 enjoyers

Colab Notebook

Inspired by Modded NanoGPT and my goat Jonas Geiping (Cramming), I trained a custom GPT I've been working on over at Dagonet, got to the 3.28 val loss on a single T4.

Important! Note/Future-bugifx

As @main_horse pointed out, I wrote a method that had the DSMoE class send the current tok to all experts, then apply router weights, so it removed the hard selection of the router, and made it more of a soft weighing instead, the hard routing is loss ~0.1 lower, or about 10 steps faster, but wallclock time per step is 2x longer and init was 8x longer, working on GEMMs

caveats:

  • Less than the 120M from main speedrun for stability
  • was just a 1B subset of finewebedu10b, not filtered or anything I just processed that much at this time, will probably fix this later

Runs

Ranking Time - date Data Person Description log
1 7.09m - 4/5/25 ~3.27M tok (1024 * 8 * 4 * 100) Vatsa now GPT-2 tokenizer shrunk vocab_size, and also shrunk head_lm and n_experts for stability, less params, now at ~73m log
2 11.69m - 4/4/25 ~3.93M tok (1024 * 8 * 4 * 120) Vatsa lr tuning (5e-4) log
3 14.86m - 4/2/25 ~5.21M tok (1024 * 8 * 4 * 160) Vatsa 3x lr, removed ckpt saves every step, less printing log
4 15.04m - 4/1/25 ~3.89M tok (1024 * 5 * 4 * 190) Vatsa Used Muon instead log
5 37.17m - 4/1/25 ~6.14M tok (1024 * 5 * 4 * 300) Vatsa Added PSGD log
6 70.61m - 3/31/25 ~14M tok (1024 * 6 * 4 * 570) Vatsa First Run, has DS-MoE, MLA+NSA hybrid, Rope, etc log

Unofficial Runs

Ranking Time - date Data Person Description log
1st 7.63m - 4/1/25 ~6.96M tok (1024 * 10 * 4 * 170) Vatsa Used an A100 with (15.04m - 4/1/25) run to see how I look on a real GPU log

About

NanoGPT-speedrunning for the poor T4 enjoyers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages