Skip to content

teaching-on-testbeds/mltrain-chi

Repository files navigation

In this tutorial, we explore some of the infrastructure and platform requirements for large model training, and to support the training of many models by many teams. We focus specifically on scheduling training jobs on a GPU cluster (using Ray).

Follow along at Train ML models with Ray.

Note: this tutorial requires advance reservation of specific hardware! You will need a node with 2 GPUs suitable for model training. You should reserve a 3-hour block for the Ray experiment.

You can use either:

  • a gpu_mi100 at CHI@TACC (but, make sure the one you select has 2 GPUs), or
  • a compute_liqid at CHI@TACC (again, make sure the one you select has 2 GPUs)

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

About

Deploy Ray on Chameleon.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors