A Python wrapper around vLLM for running LLM inference on NVIDIA GPUs. Includes an interactive menu for selecting models, automatic GPU detection, and YAML-based configuration profiles.
Currently set up for Ubuntu (tested on 24.04).
- Interactive menu for model selection and configuration
- Automatic GPU count detection and tensor parallelism setup
- YAML configuration profiles for common models (Qwen, Gemma, LLaMA)
- Interactive setup for creating custom model configurations
- vRAM requirement estimation - predict memory usage before downloading models
- Full CLI support with all vLLM parameters
- OpenAI-compatible API server
-
Run the setup script:
./setup.sh
-
Activate the environment:
source activate_vllm.sh
-
Check GPU configuration:
python multi_gpu_config.py
-
Estimate vRAM requirements:
python estimate_vram.py
-
Test basic inference:
python basic_inference.py
-
Start API server:
Interactive Menu
python api_server.py
The menu provides:
- Profile selection with automatic GPU detection
- Interactive profile creation
- One-time manual configuration
- Profile listing and management
Command Line Interface
# List available profiles python api_server.py --list-profiles # Use a profile python api_server.py --profile qwen3-30b-a3b-gptq-int4 python api_server.py --profile redhat-gemma-3-27b-it-quantized-w4a16 # Override profile settings python api_server.py --profile redhat-gemma-3-27b-it-quantized-w4a16 --port 8080 --max-model-len 8192 # Force specific GPU count python api_server.py --profile qwen3-30b-a3b-gptq-int4 --tensor-parallel-size 1 # Manual configuration with auto GPU detection python api_server.py \ --model "meta-llama/Llama-2-13b-hf" \ --tensor-parallel-size auto \ --gpu-memory-utilization 0.9 \ --max-model-len 4096 \ --dtype float16 # Use custom profile file python api_server.py --profile ~/my-profiles/custom.yaml # Show all options python api_server.py --help
-
Pre-download models (optional):
# List all models from profiles python predownload.py --list # Download specific model (bypasses vLLM memory checks) python predownload.py --model mistralai/Magistral-Small-2509 # Download from profile python predownload.py --profile qwen3-30b-a3b-gptq-int4 # Download all profile models python predownload.py --all
-
Monitor GPU usage:
python monitor_gpus.py
To reactivate the environment from anywhere:
source /path/to/vllm-nvidia/activate_vllm.sh
The server is OpenAI-compatible. Start with:
# Activate environment first
source activate_vllm.sh
# Interactive menu
python api_server.py
# Or directly with a profile
python api_server.py --profile qwen3-30b-a3b-gptq-int4
Then test with curl:
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "RedHatAI/gemma-3-27b-it-quantized.w4a16",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"max_tokens": 100
}'
Create your own model profiles in profiles/
directory as YAML files:
name: my_model
description: My custom model configuration
model: path/to/model
tensor_parallel_size: auto # or use a number like 2
gpu_memory_utilization: 0.95
max_model_len: 16384
dtype: bfloat16
See profiles/README.md
for detailed documentation.
The included estimate_vram.py
tool helps predict memory requirements before downloading models:
# Activate environment first
source activate_vllm.sh
# Analyze all model profiles
python estimate_vram.py
# Show memory requirements per GPU instead of total
python estimate_vram.py --per-gpu
# Get optimization suggestions for models that don't fit
python estimate_vram.py --suggest
# Analyze a specific profile
python estimate_vram.py --profile qwen3-30b-a3b-gptq-int4
# Show verbose breakdown
python estimate_vram.py --verbose
The tool:
- Fetches actual model sizes from HuggingFace API
- Estimates KV cache and activation memory
- Shows which models will fit in your available vRAM
- Explains why some models fail to download
- Provides optimization suggestions
For models that vLLM thinks won't fit but you want to try anyway:
# Activate environment first
source activate_vllm.sh
# List models and their download status
python predownload.py --list
# Download without vLLM's memory checks
python predownload.py --model mistralai/Magistral-Small-2509
# Download all models from profiles
python predownload.py --all
# Force re-download
python predownload.py --model some/model --force
This bypasses vLLM's memory estimation and downloads models directly from HuggingFace.
- Out of Memory: Reduce
gpu_memory_utilization
ormax_model_len
- Slow Loading: Ensure models are cached in
~/.cache/huggingface/
- CUDA Errors: Check
nvidia-smi
and restart if needed