Skip to content

teaching-on-testbeds/serve-system-chi

Repository files navigation

In this tutorial, we explore some system-level optimizations for model serving. You will:

  • learn how to wrap a model in an HTTP endpoint using FastAPI
  • and explore system-level optimizations for model serving, including concurrency and batching, in Triton Inference Server

Follow along at System optimizations for serving machine learning models.

Note: this tutorial requires advance reservation of specific hardware! You should reserve:

  • A gpu_p100 node at CHI@TACC, which has two NVIDIA P100 GPUs

and you will need a 3-hour block of time.


This material is based upon work supported by the National Science Foundation under Grant No. 2230079.

About

Practice system-level optimizations for inference serving on Chameleon.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors