gpu_bpe

the purpose of this guide is to demonstrate how to train a Byte-Pair Encoding tokenizer at a scale that's actually useable thanks to GPUs. by "useable" what I mean is that 1) most guides on the internet (eg. Karpathy's) run not only on the CPU but even worse in Python, meaning they're too slow to run on a large dataset and 2) you need a large dataset in order to avoid large documents biasing the distribution. the common practice I've observed is to just use pre-trained tokenizers but I prefer doing things from scratch as it 1) allows for experimentation at the tokenizer level and 2) ensures understanding. i have seen tokenizers built in faster languages such as Rust but I'm a GPU programmer not a Rust programmer and I'd bet GPUs are still much faster and capable of handling larger datasets for this task (somebody please fact check me on that).

instructions

the repo is split in to three parts. pip install -r requirements.txt

first up is a traditional training on the CPU for demonstration purposes, similar to Karpathy's lesson or the tiktoken example implementation. use arguments -v to set vocabular size and -n to set the number of characters (not documents!) of Fineweb to be trained on (defaults 1000 and 2^20 respectively). this will train on the order of single digit merges per second

python train_on_CPU.py -v 1000 -n 1048576

next is the actual GPU algorithm (derived initially from this repo although I had to change the algorithm significantly in order to support regex) which ofc requires an Nvidia GPU. you might be able to get it running on Apple or AMD GPUs although they oftentimes don't yet have full support for rarely-used operations so I won't guaruntee it. this will train on th eorder of triple digit merges per second for the same sized dataset, a two orders of magnitude speedup

python train_on_GPU.py -v 1000 -n 1048576

finally this is what you really came here for, training on multiple GPUs. values here default to a vocabulary of (2^16)-2 in order to fit token ids on int16 (we need those last two values for something else) and character count of 2^27 (the highest power of 2 that fits on 8GB of VRAM assuming vocab size <= (2^16)-2). my guess for how large a character count each 80GB GPU could handle is a bit over a billion, but I'll come back and update this readme once I've actually tested that limit. this has a couple key upgrades over the prior script:
- using pytorch's distributed package to communicate between GPUs. the algorithm that decides what to communicate is more complicated than what you'll see in a regular ML model training because of the heterogenous tensor shapes between each GPU
- pytorch doesn't support a whole lot of operations with unsigned integer data types, so in order to get the most out of int16 (or int32 if your vocab size is huge) I had to implement a simple trick reminiscent of countable-infinity proofs.
- if you've got enough high end GPUs then there's a good chance your collective GPU VRAM has more capacity than your CPU's RAM even though a byte-character only takes up 8bits while being loaded in while on the GPU we represent it as 16 or even 32 bits. in order to prevent an OOM error on CPU RAM, we download, pre-character-level tokenize, and store the data in .bin files. then, for loading onto GPU we do so in chunks. a boring and cumbersome edit really but unfortunately necessary if you're using 8 A100s or similar

torchrun --nproc_per_node=8 train_on_many_GPUs.py -v 65534 -n 1073741824

other

that's all! if you're interested in guides/demos for amateurs that are actually bordering on big-LLM-lab level capabilities rather than comically tiny (& therefore not actually useable) toy examples, check out my other repo gpt-lab. It's currently in alpha but my plan is to do something similar to this repo for the whole entire LLM pre-training process in a manner that helps amateurs (with a hopefully little bit of self-funding) do reasonable-ish scale experiments from scratch in a replicable and quickly iterable manner

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
tokenizers		tokenizers
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_on_CPU.py		train_on_CPU.py
train_on_GPU.py		train_on_GPU.py
train_on_many_GPUs.py		train_on_many_GPUs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

gpu_bpe

instructions

other

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

evintunador/gpu_bpe

Folders and files

Latest commit

History

Repository files navigation

gpu_bpe

instructions

other

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages