Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -252,6 +252,11 @@ jobs:
timeout: 20
runner: ${{ needs.pre-flight.outputs.runner_prefix }}-gpu-x2
test-data-path: ${{ needs.pre-flight.outputs.test_data_path }}
- test-name: L2_vLLM_Deployment
test-folder: vllm_deployment
timeout: 30
runner: ${{ needs.pre-flight.outputs.runner_prefix }}-gpu-x2
test-data-path: ${{ needs.pre-flight.outputs.test_data_path }}
needs: [pre-flight, cicd-unit-tests]
runs-on: ${{ matrix.runner }}
name: ${{ matrix.test-name }}
Expand Down
43 changes: 43 additions & 0 deletions .github/workflows/config/.secrets.baseline
Original file line number Diff line number Diff line change
Expand Up @@ -195,6 +195,45 @@
"line_number": 101
}
],
"tests/unit_tests/launcher/test_template.py": [
{
"type": "Secret Keyword",
"filename": "tests/unit_tests/launcher/test_template.py",
"hashed_secret": "829c3804401b0727f70f73d4415e162400cbe57b",
"is_verified": false,
"line_number": 25
}
],
"tests/unit_tests/config/test_allowed_import_prefixes.py": [
{
"type": "Secret Keyword",
"filename": "tests/unit_tests/config/test_allowed_import_prefixes.py",
"hashed_secret": "a761ce3a45d97e41840a788495e85a70d1bb3815",
"is_verified": false,
"line_number": 80
},
{
"type": "Secret Keyword",
"filename": "tests/unit_tests/config/test_allowed_import_prefixes.py",
"hashed_secret": "4fa2519ad2d85de09d4f42a73eb361a01efc1549",
"is_verified": false,
"line_number": 81
},
{
"type": "Secret Keyword",
"filename": "tests/unit_tests/config/test_allowed_import_prefixes.py",
"hashed_secret": "77ace0be5333763f83a5456cf55fa362e23a2538",
"is_verified": false,
"line_number": 97
},
{
"type": "Secret Keyword",
"filename": "tests/unit_tests/config/test_allowed_import_prefixes.py",
"hashed_secret": "aa6878b1c31a9420245df1daffb7b223338737a3",
"is_verified": false,
"line_number": 101
}
],
"tests/unit_tests/launcher/test_template.py": [
{
"type": "Secret Keyword",
Expand All @@ -205,5 +244,9 @@
}
]
},
<<<<<<< Updated upstream
"generated_at": "2026-03-03T19:52:57Z"
=======
"generated_at": "2026-03-03T20:22:24Z"
>>>>>>> Stashed changes
}
298 changes: 298 additions & 0 deletions docs/guides/llm/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
# Deploying Models with vLLM and SGLang

NeMo AutoModel saves every checkpoint in **native Hugging Face format** (Safetensors + `config.json` + tokenizer).
This means the same checkpoint directory that NeMo AutoModel writes can be loaded directly (without any **conversion step**) by any tool in the Hugging Face ecosystem, including the two most popular inference engines: [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang).

Just point the engine at your checkpoint path and serve:

```bash
vllm serve checkpoints/epoch_0_step_100/model/consolidated/ --port 8000
```

Both engines expose an **OpenAI-compatible API**, so you can swap them in without changing client code.

:::{seealso}
- [Fine-Tuning Guide](finetune.md) — train or adapt a model before deployment.
- [Checkpointing Guide](../checkpointing.md) — checkpoint formats, consolidation, and Safetensors output.
:::

---

## Prerequisites

| Requirement | Minimum |
|-------------|---------|
| GPU | NVIDIA GPU with 8 GB+ VRAM (16 GB+ recommended for 1B+ models) |
| CUDA | 11.8+ |
| Python | 3.9+ |

Install either engine (or both):

```bash
pip install vllm # vLLM
pip install "sglang[all]" # SGLang
```

:::{tip}
If you are inside the NeMo AutoModel Docker container, vLLM is already installed.
:::

---

## Checkpoint Requirements

vLLM and SGLang load models in **Hugging Face format**. A valid checkpoint directory looks like:

```text
my-checkpoint/
├── config.json
├── tokenizer.json (or tokenizer_config.json + tokenizer.model)
├── model.safetensors (or sharded model-00001-of-*.safetensors + model.safetensors.index.json)
└── generation_config.json (optional)
```

NeMo AutoModel produces this layout automatically when you set `save_consolidated: true` and `model_save_format: safetensors` in the checkpoint config (the defaults).
See the [Checkpointing guide](../checkpointing.md) for details.

:::{important}
If your checkpoint directory contains **sharded DCP files** (`.distcp`) rather than consolidated Safetensors, you need to consolidate them first.
Use the AutoModel checkpoint utility or the recipe's `save_consolidated: true` flag to produce a Hugging Face-compatible directory.
:::

---

## Quick Local Test with a Small Model

The examples below use **`Qwen/Qwen2.5-0.5B-Instruct`** (~0.5 B parameters, ~1 GB on disk).
It runs comfortably on a single GPU with 4 GB+ VRAM and does not require any gated-access agreement, making it ideal for a quick smoke test.

:::{note}
If you already have access to `meta-llama/Llama-3.2-1B-Instruct`, you can substitute it in any of the commands below.
Just make sure your Hugging Face token is set:
```bash
export HF_TOKEN=hf_xxxxx
```
:::

### Deploy with vLLM

#### Option A: OpenAI-compatible API server

```bash
vllm serve Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 8000
```

In a separate terminal, send a request:

```bash
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 64
}' | python3 -m json.tool
```

#### Option B: Python offline inference

```python
from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=64)
outputs = llm.generate(["What is the capital of France?"], sampling_params=params)
print(outputs[0].outputs[0].text)
```

### Deploy with SGLang

#### Option A: OpenAI-compatible API server

```bash
python -m sglang.launch_server \
--model-path Qwen/Qwen2.5-0.5B-Instruct \
--host 0.0.0.0 \
--port 30000
```

Query it with the same `curl` style (change the port to `30000`):

```bash
curl -s http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 64
}' | python3 -m json.tool
```

#### Option B: Python engine API

```python
import sglang as sgl

llm = sgl.Engine(model_path="Qwen/Qwen2.5-0.5B-Instruct")
output = llm.generate("What is the capital of France?", sampling_params={"max_new_tokens": 64})
print(output["text"])
```

---

## Deploy a Local Checkpoint

After fine-tuning with NeMo AutoModel, pass the **consolidated checkpoint path** instead of a Hugging Face Hub name.

### vLLM

```bash
vllm serve /path/to/checkpoints/epoch_0_step_100/model/consolidated/ \
--host 0.0.0.0 \
--port 8000
```

```python
from vllm import LLM, SamplingParams

llm = LLM(model="/path/to/checkpoints/epoch_0_step_100/model/consolidated/")
outputs = llm.generate(["Hello, world!"], SamplingParams(max_tokens=64))
print(outputs[0].outputs[0].text)
```

### SGLang

```bash
python -m sglang.launch_server \
--model-path /path/to/checkpoints/epoch_0_step_100/model/consolidated/ \
--host 0.0.0.0 \
--port 30000
```

### Separate tokenizer

If the tokenizer is not inside the checkpoint directory, point to it explicitly:

```bash
# vLLM
vllm serve /path/to/checkpoint --tokenizer /path/to/tokenizer

# SGLang
python -m sglang.launch_server \
--model-path /path/to/checkpoint \
--tokenizer-path /path/to/tokenizer
```

---

## Deploy a LoRA / PEFT Adapter

vLLM supports serving LoRA adapters on top of a base model without merging weights.

```python
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest

llm = LLM(model="meta-llama/Llama-3.2-1B-Instruct", enable_lora=True)

params = SamplingParams(temperature=0.7, max_tokens=64)
outputs = llm.generate(
["What is the capital of France?"],
sampling_params=params,
lora_request=LoRARequest("my-adapter", 1, "/path/to/lora/adapter"),
)
print(outputs[0].outputs[0].text)
```

Alternatively, use the NeMo AutoModel `vLLMHFExporter` helper:

```python
from nemo.export.vllm_hf_exporter import vLLMHFExporter

exporter = vLLMHFExporter()
exporter.export(model="/path/to/base/model", enable_lora=True)
exporter.add_lora_models(lora_model_name="my-adapter", lora_model="/path/to/lora/adapter")
print(exporter.forward(input_texts=["How are you?"], lora_model_name="my-adapter"))
```

---

## Common Configuration Flags

### vLLM

| Flag | Purpose |
|------|---------|
| `--tensor-parallel-size N` | Shard the model across N GPUs |
| `--dtype auto` | Auto-detect dtype (float16 / bfloat16) |
| `--max-model-len 4096` | Cap the maximum context length |
| `--gpu-memory-utilization 0.9` | Fraction of GPU memory to allocate |
| `--quantization awq` | Load a pre-quantized model (awq, gptq, etc.) |
| `--enforce-eager` | Disable CUDA graphs (useful for debugging) |

### SGLang

| Flag | Purpose |
|------|---------|
| `--tp N` | Tensor parallelism across N GPUs |
| `--dtype auto` | Auto-detect dtype |
| `--mem-fraction-static 0.85` | GPU memory fraction for KV cache |
| `--quantization awq` | Load a pre-quantized model |
| `--context-length 4096` | Maximum context length |

---

## Multi-GPU Deployment

For models that exceed the memory of a single GPU, increase the tensor-parallelism degree:

```bash
# vLLM on 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000

# SGLang on 4 GPUs
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-70B-Instruct \
--tp 4 \
--port 30000
```

---

## Docker Deployment

Run vLLM as a self-contained container:

```bash
docker run --gpus all -p 8000:8000 \
-v /path/to/local/checkpoint:/model \
vllm/vllm-openai:latest \
--model /model
```

Or use the NeMo AutoModel container (vLLM is pre-installed):

```bash
docker run --gpus all -it --rm \
--shm-size=8g \
-p 8000:8000 \
-v /path/to/checkpoint:/model \
nvcr.io/nvidia/nemo-automodel:25.11.00 \
vllm serve /model --host 0.0.0.0 --port 8000
```

---

## Troubleshooting

| Problem | Solution |
|---------|----------|
| `FileNotFoundError: config.json` | The checkpoint path must point to the directory that **contains** `config.json`. If you used NeMo AutoModel, this is the `model/consolidated/` subdirectory. |
| `torch.cuda.OutOfMemoryError` | Reduce `--max-model-len`, lower `--gpu-memory-utilization`, or increase `--tensor-parallel-size`. |
| Tokenizer not found | Pass `--tokenizer` (vLLM) or `--tokenizer-path` (SGLang) explicitly. |
| Gated model 401 error | Set `export HF_TOKEN=hf_xxxxx` or run `huggingface-cli login`. |
| Slow first request | The first request warms up CUDA graphs and the KV cache. Subsequent requests will be faster. |
Loading
Loading