Large-scale model training on Chameleon

In this tutorial, we practice fine-tuning a large language model. We will use a selection of techniques to allow us to train models that would not otherwise fit in GPU memory:

gradient accumulation
reduced precision
activation checkpointing
CPU offload
parameter efficient fine tuning
distributed training across multiple GPUs

Follow along at Large-scale model training on Chameleon.

This lab has two parts:

single/: single-GPU large-model training, requires an A100 80GB or H100 GPU
multi/: multi-GPU large-model training, requires a 4x A100 80GB or 4x H100 GPU

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
_layouts		_layouts
multi		multi
scripts		scripts
single		single
snippets		snippets
0_intro.ipynb		0_intro.ipynb
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
_config.yml		_config.yml
advance_reserve_a100.md		advance_reserve_a100.md
advance_reserve_h100.md		advance_reserve_h100.md
index.md		index.md
index_a100.md		index_a100.md
index_h100.md		index_h100.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large-scale model training on Chameleon

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Large-scale model training on Chameleon

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages