In this tutorial, we explore some of the infrastructure and platform requirements for large model training, and to support the training of many models by many teams. We focus specifically on scheduling training jobs on a GPU cluster (using Ray).
Follow along at Train ML models with Ray.
Note: this tutorial requires advance reservation of specific hardware! You will need a node with 2 GPUs suitable for model training. You should reserve a 3-hour block for the Ray experiment.
You can use either:
- a
gpu_mi100at CHI@TACC (but, make sure the one you select has 2 GPUs), or - a
compute_liqidat CHI@TACC (again, make sure the one you select has 2 GPUs)
This material is based upon work supported by the National Science Foundation under Grant No. 2230079.