Qwen2.5-VL API Server

An OpenAI-compatible API server for the Qwen2.5-VL vision-language model, enabling multimodal conversations with image understanding capabilities.

Features

OpenAI-compatible API endpoints
Support for vision-language tasks
Image analysis and description
Base64 image handling
JSON response formatting
System resource monitoring
Health check endpoint
CUDA/GPU support with Flash Attention 2
Docker containerization

Prerequisites

Python 3.9.12
Docker and Docker Compose
NVIDIA GPU with CUDA support (recommended)
NVIDIA Container Toolkit
At least 24GB GPU VRAM (for 7B model)
32GB+ system RAM recommended

Quick Start

Clone the repository:

git clone https://github.com/phildougherty/qwen2.5-VL-inference-openai.git
cd qwen-vision

Download the model:

mkdir -p models
./download_model.py

Start the service:

docker-compose up -d

Test the API:

curl http://localhost:9192/health

Command Line Arguments

--port

Specifies the port to listen on for OpenAI compatible HTTP requests. Default: 9192

--model

Specifies the model to load. This will be downloaded automatically if it does not exist.
Default: Qwen2.5-VL-7B-Instruct
Choices: Qwen2.5-VL-3B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen2.5-VL-72B-Instruct

--resume

Resumes a failed download.

--quant

Enables bitsandbytes quantisation Choices: int8, int4

API Endpoints

GET /v1/models

Lists available models and their capabilities.

curl http://localhost:9192/v1/models | jq .

POST /v1/chat/completions

Main endpoint for chat completions with vision support.

Example with text:

curl -X POST http://localhost:9192/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-VL-7B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "What is the capital of France?"
      }
    ]
  }'

Example with image:

curl -X POST http://localhost:9192/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-VL-7B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What do you see in this image?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,..."
            }
          }
        ]
      }
    ]
  }'

GET /health

Health check endpoint providing system information.

curl http://localhost:9192/health

Configuration

Environment variables in docker-compose.yml:

NVIDIA_VISIBLE_DEVICES: GPU device selection
QWEN_MODEL: Select the Qwen 2.5 VL model to load

Integration with OpenWebUI

In OpenWebUI admin panel, add a new OpenAI API endpoint:
- Base URL: http://<server name>:9192/v1
- API Key: (leave blank)
The model will appear in the model selection dropdown with vision capabilities enabled.

System Requirements

Minimum:

NVIDIA GPU with 24GB VRAM
16GB System RAM
50GB disk space

Recommended:

NVIDIA RTX 3090 or better
32GB System RAM
100GB SSD storage

Docker Compose Configuration

services:
  qwen-vl-api:
    build: .
    ports:
      - "9192:9192"
    volumes:
      - ./models:/app/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    shm_size: '8gb'
    restart: unless-stopped

Development

To run in development mode:

# Install dependencies
pip install -r requirements.txt

# Run the server
python app.py

Monitoring

The API includes comprehensive logging and monitoring:

System resource usage
GPU utilization
Request/response timing
Error tracking

View logs:

docker-compose logs -f

Error Handling

The API includes robust error handling for:

Invalid requests
Image processing errors
Model errors
System resource issues

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Qwen team for the base model
FastAPI for the web framework
Transformers library for model handling

Support

For issues and feature requests, please use the GitHub issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Qwen2.5-VL API Server

Features

Prerequisites

Quick Start

Command Line Arguments

--port

--model

--resume

--quant

API Endpoints

GET /v1/models

POST /v1/chat/completions

GET /health

Configuration

Integration with OpenWebUI

System Requirements

Docker Compose Configuration

Development

Monitoring

Error Handling

Contributing

License

Acknowledgments

Support

About

Releases

Packages

Languages

deece/qwen2.5-VL-inference-openai

Folders and files

Latest commit

History

Repository files navigation

Qwen2.5-VL API Server

Features

Prerequisites

Quick Start

Command Line Arguments

--port

--model

--resume

--quant

API Endpoints

GET /v1/models

POST /v1/chat/completions

GET /health

Configuration

Integration with OpenWebUI

System Requirements

Docker Compose Configuration

Development

Monitoring

Error Handling

Contributing

License

Acknowledgments

Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages