CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

Long Xing* · Xiaoyi Dong* · Yuhang Zang · Yuhang Cao · Jianze Liang · Qidong Huang · Jiaqi Wang · Feng Wu · Dahua Lin

📖Paper | 🏠Github |🤗CapRL-3B Model |🤗CapRL-InternVL3.5-8B Model | 🤗CapRL-2M Dataset

🤗CapRL Collection | 🤗Daily Paper ｜🤗CapRL-3B-GGUF ｜🤗CapRL-3B-i1-GGUF

Now you can try out CapRL-3B with your own images🎨! ➡️ 🌈CapRL Space

When selecting between the available CapRL models, it's essential to consider the trade-off between performance and computational cost. This guide will help you choose the most suitable model for your specific needs:

Model	Parameters	Strength
🤗CapRL-3B	3B	Speed, Efficiency
🤗CapRL-InternVL3.5-8B	8B	High Performance, Advanced Captioning Ability

📢 News

We are working on even stronger base models and upgrading our training recipe — stay tuned!

🔥 [10/15/2025] The total downloads of the CapRL-related models and dataset reached 6,000 within just 20 days!
🚀 [10/15/2025] We are excited to announce the release of CapRL-InternVL3.5-8B, whose image captioning capability outperforms Qwen2.5-VL-72B!
🚀 [10/15/2025] Thanks mradermacher for the valuable contribution! CapRL-3B-GGUF is the static quants version, and CapRL-3B-i1-GGUF is weighted/imatrix quants version.
🚀 [10/15/2025] We release QA curation code.
🚀 [09/25/2025] We release CapRL repository, CapRL-3B model, evaluation code and dataset.

Introduction

🌈We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B. By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.

💡 Highlights

🔥 Remarkable visual understanding for Chart, Infographics and Document: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
🔥 Well-organized output: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
🔥 Detailed description for natural images: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.

Model Card

Based on the same recipe as CapRL-3B, we used InternVL3.5-8B as the policy model and obtained CapRL-InternVL3.5-8B through CapRL.
CapRL-3B-GGUF is static quants version, and CapRL-3B-i1-GGUF is weighted/imatrix quants version. Thanks for their contribution!

👨‍💻 Todo

Release training code.
Release 75k QA dataset.
Release CapRL-series on stronger base model.

🛠️ Setup

git clone https://github.com/InternLM/CapRL.git
conda create -n CapRL python=3.10
conda activate CapRL
bash setup.sh

⭐️ Quick Start

If you want to use CapRL-3B for captioning, you can directly follow the exact same inference approach as in Qwen2.5-VL-series.

The prompt we use for training and evaluation is Please describe this image in detail.

We recommend using vLLM to speed up inference.

Start an OpenAI API Service

Run the command below to start an OpenAI-compatible API service:

vllm serve "/PATH/CapRL-3B" \
    --trust-remote-code \
    --tensor-parallel-size=1 \
    --pipeline-parallel-size=1 \
    --gpu_memory_utilization=0.95 \
    --served-model-name=caprl \
    --port 8000 \
    --host 0.0.0.0

Then you can use the chat API as below: (see OpenAI API protocol document for more details):

import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
    model="caprl",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_qwen
                    },
                },
                {"type": "text", "text": "Please describe this image in detail."},
            ],
        },
    ],
    temperature=1.0,
    max_tokens=max_tokens,
    top_p=1.0,
    extra_body={
        "repetition_penalty": 1.0,
        },
)
print("Chat response:", chat_response)

QA Curation

This part of the code is in the QA_data_curation folder, which contains all four steps for generating QA data:

QA generation. Use Qwen2.5-VL-72B to generate 5 QAs for each image. The generation process launches a vLLM service and uses multi-threading to speed up.
QA extraction. Extract QAs through format matching.
Qwen2.5-VL-3B answer question. Use Qwen2.5-VL-3B to answer questions with and without images. The parameter ROTATE_NUM controls how many times each question is answered. If a question is answered only once, the randomness may be too high and can easily lead to misjudgment.
Filter question. We keep QA pairs with visual acc higher than 0.75 and text acc lower than 0.25 to avoid data leakage and ensure the model can correctly answer questions when images are provided.

Pretraining

Datasets

Our CapRL-2M dataset is available on : 🔗 Hugging Face

It includes images from ShareGPT-1M and DenseFusion-1M, with high-quality captions re-annotated using CapRL-3B, totaling 2M samples.

In our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M.

Reproducing Pretraining Experiments

To reproduce the pretraining experiments presented in our paper:

Initialize Qwen2.5-VL. Follow the steps in the notebook initiallize_vlm_3b.ipynb to set up the Qwen2.5-VL model for training.
Training. You can then use LLaMAFactory directly to run the training process.

Comparing Caption Quality via Prism Framework

We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:

First, a model generates a caption for the image.
Then, a language model answers questions based solely on the generated caption.

This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.

The complete evaluation scripts can be found in the Prism_Evaluation folder, with the core implementation located in Eval_CapRL.py.

The model used for answering questions based on captions is CapRL-Eval-3B, which is a finetuned version of Qwen2.5-VL-3B. When dealing with tasks such as ChartQA (not multiple-choice questions), it provides more stable output formatting.

You can specify --reward-model-path as the path to CapRL-Eval-3B in Eval_CapRL.py.

Cases

📄 License

Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

❤️ Acknowledgments

Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Prism_Evaluation		Prism_Evaluation
QA_data_curation		QA_data_curation
assets		assets
eval_results/CapRL-3B		eval_results/CapRL-3B
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

📢 News

Introduction

💡 Highlights

Model Card

👨‍💻 Todo

🛠️ Setup

⭐️ Quick Start

Start an OpenAI API Service

QA Curation

Pretraining

Datasets

Reproducing Pretraining Experiments

Comparing Caption Quality via Prism Framework

Cases

📄 License

❤️ Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

InternLM/CapRL

Folders and files

Latest commit

History

Repository files navigation

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning

📢 News

Introduction

💡 Highlights

Model Card

👨‍💻 Todo

🛠️ Setup

⭐️ Quick Start

Start an OpenAI API Service

QA Curation

Pretraining

Datasets

Reproducing Pretraining Experiments

Comparing Caption Quality via Prism Framework

Cases

📄 License

❤️ Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages