GitHub - teaching-on-testbeds/mltrain-chi: Deploy Ray on Chameleon.

In this tutorial, we explore some of the infrastructure and platform requirements for large model training, and to support the training of many models by many teams. We focus specifically on scheduling training jobs on a GPU cluster (using Ray).

Follow along at Train ML models with Ray.

Note: this tutorial requires advance reservation of specific hardware! You will need a node with 2 GPUs suitable for model training. You should reserve a 3-hour block for the Ray experiment.

You can use either:

a gpu_mi100 at CHI@TACC (but, make sure the one you select has 2 GPUs), or
a compute_liqid at CHI@TACC (again, make sure the one you select has 2 GPUs)

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
_layouts		_layouts
docker		docker
filters		filters
images		images
snippets		snippets
workspace_ray		workspace_ray
.gitmodules		.gitmodules
0_intro.ipynb		0_intro.ipynb
1_create_lease.ipynb		1_create_lease.ipynb
2_create_server.ipynb		2_create_server.ipynb
3_prepare_data.ipynb		3_prepare_data.ipynb
4_start_ray.ipynb		4_start_ray.ipynb
5_submit_ray.ipynb		5_submit_ray.ipynb
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
_config.yml		_config.yml
index.md		index.md
index_amd.md		index_amd.md
index_nvidia.md		index_nvidia.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages