Skip to content

Commit 4a58342

Browse files
Share observations
1 parent f144253 commit 4a58342

File tree

1 file changed

+13
-1
lines changed
  • serverless-fleets/tutorials/inferencing

1 file changed

+13
-1
lines changed

serverless-fleets/tutorials/inferencing/README.md

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ This tutorial provides a comprehensive guide on using Serverless GPUs to perform
44

55
![](../../images/inferencing-highlevel-architecture.png)
66

7+
8+
## Use Case
9+
710
The concrete example extracts temperature and duration of a set of cookbook recipes (from [recipebook](https://github.com/dpapathanasiou/recipebook)) by using a vLLM. Such a cookbook recipe looks like:
811
```
912
{
@@ -63,7 +66,7 @@ Key steps covered in the tutorial:
6366
6467
> Note: The tutorial uses the [IBM Granite-4.0-Micro](https://huggingface.co/ibm-granite/granite-4.0-micro) model which is downloaded from huggingface by vllm during the first run. Since `~/.cache/huggingface` in the container is mounted to the COS bucket, the model is being downloaded from COS for subsequent runs. (Tip: Advanced users might want to create a separate bucket acting as the model cache)
6568
66-
## Duration
69+
## Durations
6770

6871
The different phases of the use case take some time depending on the GPU family, the size of the container image and the size of the large language model. The following durations are expected.
6972

@@ -81,6 +84,15 @@ The different phases of the use case take some time depending on the GPU family,
8184
| **Inferencing per recipe and worker** | **0.028s** | **0.001s** |
8285
| | | |
8386

87+
## Observations
88+
89+
In this example, with a single worker that contains 8xH100 it's possible to perform 8000 inferencing calls within 8 seconds. Since the worker as some overhead in terms of initilziation, the use of H100 only pays out when the batches are very large.
90+
91+
H100 also allow to run larger models, like the [Granite-4.0-H-Small](https://huggingface.co/ibm-granite/granite-4.0-h-small) with 32B parameters or the [Qwen3-Coder-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Coder-30B-A3B-Instruct) which result in higher quality, but also higher costs and lower throughput.
92+
93+
For development use cases or small models, the L40s is a quite good alternative as it has faster initlization times.
94+
95+
However, for production use cases and if the batches are large enough, the H100 is more cost-efficient and recommended.
8496

8597
## Steps
8698

0 commit comments

Comments
 (0)