In this tutorial, we explore some system-level optimizations for model serving. You will:
- learn how to wrap a model in an HTTP endpoint using FastAPI
- and explore system-level optimizations for model serving, including concurrency and batching, in Triton Inference Server
Follow along at System optimizations for serving machine learning models.
Note: this tutorial requires advance reservation of specific hardware! You should reserve:
- A
gpu_p100node at CHI@TACC, which has two NVIDIA P100 GPUs
and you will need a 3-hour block of time.
This material is based upon work supported by the National Science Foundation under Grant No. 2230079.