An OpenAI-compatible API server for the Qwen2.5-VL vision-language model, enabling multimodal conversations with image understanding capabilities.
- OpenAI-compatible API endpoints
- Support for vision-language tasks
- Image analysis and description
- Base64 image handling
- JSON response formatting
- System resource monitoring
- Health check endpoint
- CUDA/GPU support with Flash Attention 2
- Docker containerization
- Python 3.9.12
- Docker and Docker Compose
- NVIDIA GPU with CUDA support (recommended)
- NVIDIA Container Toolkit
- At least 24GB GPU VRAM (for 7B model)
- 32GB+ system RAM recommended
- Clone the repository:
git clone https://github.com/phildougherty/qwen2.5-VL-inference-openai.git
cd qwen-vision
- Download the model:
mkdir -p models
./download_model.py
- Start the service:
docker-compose up -d
- Test the API:
curl http://localhost:9192/health
Specifies the port to listen on for OpenAI compatible HTTP requests. Default: 9192
Specifies the model to load. This will be downloaded automatically if it does not exist.
Default: Qwen2.5-VL-7B-Instruct
Choices: Qwen2.5-VL-3B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen2.5-VL-72B-Instruct
Resumes a failed download.
Enables bitsandbytes quantisation Choices: int8, int4
Lists available models and their capabilities.
curl http://localhost:9192/v1/models | jq .
Main endpoint for chat completions with vision support.
Example with text:
curl -X POST http://localhost:9192/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'
Example with image:
curl -X POST http://localhost:9192/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-VL-7B-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What do you see in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,..."
}
}
]
}
]
}'
Health check endpoint providing system information.
curl http://localhost:9192/health
Environment variables in docker-compose.yml:
NVIDIA_VISIBLE_DEVICES
: GPU device selectionQWEN_MODEL
: Select the Qwen 2.5 VL model to load
-
In OpenWebUI admin panel, add a new OpenAI API endpoint:
- Base URL:
http://<server name>:9192/v1
- API Key: (leave blank)
- Base URL:
-
The model will appear in the model selection dropdown with vision capabilities enabled.
Minimum:
- NVIDIA GPU with 24GB VRAM
- 16GB System RAM
- 50GB disk space
Recommended:
- NVIDIA RTX 3090 or better
- 32GB System RAM
- 100GB SSD storage
services:
qwen-vl-api:
build: .
ports:
- "9192:9192"
volumes:
- ./models:/app/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=all
shm_size: '8gb'
restart: unless-stopped
To run in development mode:
# Install dependencies
pip install -r requirements.txt
# Run the server
python app.py
The API includes comprehensive logging and monitoring:
- System resource usage
- GPU utilization
- Request/response timing
- Error tracking
View logs:
docker-compose logs -f
The API includes robust error handling for:
- Invalid requests
- Image processing errors
- Model errors
- System resource issues
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Qwen team for the base model
- FastAPI for the web framework
- Transformers library for model handling
For issues and feature requests, please use the GitHub issue tracker.