Nanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:
- Simplicity: Nanotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
- Scalability: Nanotron uses the latest techniques to train models more efficiently at scale.
- Speed: This version of Nanotron focuses on HPC-oriented optimizations, typically made available via C++ extensions.
We recommend using Spack to install this version of Nanotron.
git clone -c feature.manyFiles=true --depth=2 https://github.com/spack/spack.git
git clone https://github.com/korovod/korovod-spack-packages.git
cd spack/bin
./spack repo add korovod-spack-packages
./spack install py-nanotron
Tip
It is advised to maintain a proper Spack environment to ensure reproducibility.
To install a C++ extension, simply use the corresponding Spack variant:
./spack install py-nanotron +py-datastates
Tip
We log to wandb automatically if it's installed (and the command above will install it). If you don't want to use wandb, you can run wandb disabled
.
First, have a look at the Ultrascale Playbook, a comprehensive guide to efficiently scale LLM training with Nanotron.
A good starting point is to understand the memory usage from model configurations. The Nanotron team created a tool for this purpose. Just paste your YAML configuration to generate memory diagrams.
The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the checkpoints
directory as specified in the config file.
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml
torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
# We could set a larger TP for faster generation, and a larger PP in case of very large models.
You can find more examples in the /examples
directory:
Example | Description |
---|---|
custom-dataloader |
Plug a custom dataloader to nanotron |
datatrove |
Use the datatrove library to load data |
doremi |
Use DoReMi to speed up training |
mamba |
Train an example Mamba model |
moe |
Train an example Mixture-of-Experts (MoE) model |
mup |
Use spectral µTransfer to scale up your model |
examples/config_tiny_llama_with_s3_upload.yaml |
For automatically uploading checkpoints to S3 |
We're working on adding more examples soon! Feel free to add a PR to add your own example. 🚀
We currently support the following features:
- 3D parallelism (DP+TP+PP)
- Expert parallelism for MoEs
- AFAB and 1F1B schedules for PP
- Explicit APIs for TP and PP which enables easy debugging
- ZeRO-1 optimizer
- FP32 gradient accumulation
- Parameter tying/sharding
- Custom module checkpointing for large models
- Spectral µTransfer parametrization for scaling up neural networks
- Mamba example
- Asynchronous checkpointing
And we have on our roadmap:
- Data-parallel checkpointing for reducing I/O pressure
- FP8 training
- Bucketed ZeRO-1
- ZeRO-3 optimizer (a.k.a FSDP)
-
torch.compile
support - Ring attention
- Interleaved 1f1b schedule
- Interleaved offloading of ZeRO states
- Efficient expert parallelism
The following models are currently supported:
- Mistral 7B
- Qwen
- Llama 3.2
- Llama 3.1
- StarCoder2
We thank the Hugging Face team for their work on the original project.
We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for Megatron-LM/apex
, Microsoft for DeepSpeed
, HazyResearch for flash-attn
, ANL for datastates
.