vLLM NVIDIA Wrapper

A Python wrapper around vLLM for running LLM inference on NVIDIA GPUs. Includes an interactive menu for selecting models, automatic GPU detection, and YAML-based configuration profiles.

Currently set up for Ubuntu (tested on 24.04).

Features

Interactive menu for model selection and configuration
Automatic GPU count detection and tensor parallelism setup
YAML configuration profiles for common models (Qwen, Gemma, LLaMA)
Interactive setup for creating custom model configurations
vRAM requirement estimation - predict memory usage before downloading models
Full CLI support with all vLLM parameters
OpenAI-compatible API server

Quick Start

Run the setup script:
```
./setup.sh
```
Activate the environment:
```
source activate_vllm.sh
```
Check GPU configuration:
```
python multi_gpu_config.py
```
Estimate vRAM requirements:
```
python estimate_vram.py
```
Test basic inference:
```
python basic_inference.py
```

Start API server:

Interactive Menu

python api_server.py

The menu provides:

Profile selection with automatic GPU detection
Interactive profile creation
One-time manual configuration
Profile listing and management

Command Line Interface

# List available profiles
python api_server.py --list-profiles

# Use a profile
python api_server.py --profile qwen3-30b-a3b-gptq-int4
python api_server.py --profile redhat-gemma-3-27b-it-quantized-w4a16

# Override profile settings
python api_server.py --profile redhat-gemma-3-27b-it-quantized-w4a16 --port 8080 --max-model-len 8192

# Force specific GPU count
python api_server.py --profile qwen3-30b-a3b-gptq-int4 --tensor-parallel-size 1

# Manual configuration with auto GPU detection
python api_server.py \
  --model "meta-llama/Llama-2-13b-hf" \
  --tensor-parallel-size auto \
  --gpu-memory-utilization 0.9 \
  --max-model-len 4096 \
  --dtype float16

# Use custom profile file
python api_server.py --profile ~/my-profiles/custom.yaml

# Show all options
python api_server.py --help

Pre-download models (optional):

# List all models from profiles
python predownload.py --list

# Download specific model (bypasses vLLM memory checks)
python predownload.py --model mistralai/Magistral-Small-2509

# Download from profile
python predownload.py --profile qwen3-30b-a3b-gptq-int4

# Download all profile models
python predownload.py --all

Monitor GPU usage:
```
python monitor_gpus.py
```

Environment Reactivation

To reactivate the environment from anywhere:

source /path/to/vllm-nvidia/activate_vllm.sh

API Usage

The server is OpenAI-compatible. Start with:

# Activate environment first
source activate_vllm.sh

# Interactive menu
python api_server.py

# Or directly with a profile
python api_server.py --profile qwen3-30b-a3b-gptq-int4

Then test with curl:

curl -X POST "http://localhost:8000/v1/chat/completions" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "RedHatAI/gemma-3-27b-it-quantized.w4a16",
       "messages": [{"role": "user", "content": "Hello, how are you?"}],
       "max_tokens": 100
     }'

Custom Profiles

Create your own model profiles in profiles/ directory as YAML files:

name: my_model
description: My custom model configuration
model: path/to/model
tensor_parallel_size: auto  # or use a number like 2
gpu_memory_utilization: 0.95
max_model_len: 16384
dtype: bfloat16

See profiles/README.md for detailed documentation.

vRAM Estimation Tool

The included estimate_vram.py tool helps predict memory requirements before downloading models:

# Activate environment first
source activate_vllm.sh

# Analyze all model profiles
python estimate_vram.py

# Show memory requirements per GPU instead of total
python estimate_vram.py --per-gpu

# Get optimization suggestions for models that don't fit
python estimate_vram.py --suggest

# Analyze a specific profile
python estimate_vram.py --profile qwen3-30b-a3b-gptq-int4

# Show verbose breakdown
python estimate_vram.py --verbose

The tool:

Fetches actual model sizes from HuggingFace API
Estimates KV cache and activation memory
Shows which models will fit in your available vRAM
Explains why some models fail to download
Provides optimization suggestions

Pre-download Tool

For models that vLLM thinks won't fit but you want to try anyway:

# Activate environment first
source activate_vllm.sh

# List models and their download status
python predownload.py --list

# Download without vLLM's memory checks
python predownload.py --model mistralai/Magistral-Small-2509

# Download all models from profiles
python predownload.py --all

# Force re-download
python predownload.py --model some/model --force

This bypasses vLLM's memory estimation and downloads models directly from HuggingFace.

Troubleshooting

Out of Memory: Reduce gpu_memory_utilization or max_model_len
Slow Loading: Ensure models are cached in ~/.cache/huggingface/
CUDA Errors: Check nvidia-smi and restart if needed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

vLLM NVIDIA Wrapper

Features

Quick Start

Environment Reactivation

API Usage

Custom Profiles

vRAM Estimation Tool

Pre-download Tool

Troubleshooting

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
docs/images		docs/images
profiles		profiles
.gitignore		.gitignore
README.md		README.md
activate_vllm.sh		activate_vllm.sh
api_server.py		api_server.py
basic_inference.py		basic_inference.py
estimate_vram.py		estimate_vram.py
monitor_gpus.py		monitor_gpus.py
multi_gpu_config.py		multi_gpu_config.py
predownload.py		predownload.py
setup.sh		setup.sh

lineCode/vllm-nvidia

Folders and files

Latest commit

History

Repository files navigation

vLLM NVIDIA Wrapper

Features

Quick Start

Environment Reactivation

API Usage

Custom Profiles

vRAM Estimation Tool

Pre-download Tool

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages