GitHub - teaching-on-testbeds/serve-system-chi: Practice system-level optimizations for inference serving on Chameleon.

In this tutorial, we explore some system-level optimizations for model serving. You will:

learn how to wrap a model in an HTTP endpoint using FastAPI
and explore system-level optimizations for model serving, including concurrency and batching, in Triton Inference Server

Note: this tutorial requires advance reservation of specific hardware! You should reserve:

and you will need a 3-hour block of time.

This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
_layouts		_layouts
docker		docker
fastapi_onnx		fastapi_onnx
fastapi_pt		fastapi_pt
models_staging		models_staging
snippets		snippets
workspace		workspace
0_intro.ipynb		0_intro.ipynb
1_create_lease.ipynb		1_create_lease.ipynb
2_create_server_nvidia.ipynb		2_create_server_nvidia.ipynb
3_fastapi_setup.ipynb		3_fastapi_setup.ipynb
5_triton_setup.ipynb		5_triton_setup.ipynb
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
_config.yml		_config.yml
index.md		index.md

Provide feedback