Long Xing* · Xiaoyi Dong* · Yuhang Zang · Yuhang Cao · Jianze Liang · Qidong Huang · Jiaqi Wang · Feng Wu · Dahua Lin
📖Paper |🤗CapRL-3B Model | 🤗CapRL-2M Dataset |🤗CapRL Collection | 🤗Daily Paper
🌈We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B. By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.


- 🚀 [10/15/2025] We release QA curation code.
- 🚀 [09/25/2025] We release CapRL repository, model, evaluation code and dataset.
- 🔥 Remarkable visual understanding for Chart, Infographics and Document: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
- 🔥 Well-organized output: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
- 🔥 Detailed description for natural images: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
- Release training code.
- Release 75k QA dataset.
- Release CapRL-series on stronger base model.
git clone https://github.com/InternLM/CapRL.git
conda create -n CapRL python=3.10
conda activate CapRL
bash setup.sh
If you want to use CapRL-3B for captioning, you can directly follow the exact same inference approach as in Qwen2.5-VL-series.
We recommend using vLLM to speed up inference.
Run the command below to start an OpenAI-compatible API service:
vllm serve "/PATH/CapRL-3B" \
--trust-remote-code \
--tensor-parallel-size=1 \
--pipeline-parallel-size=1 \
--gpu_memory_utilization=0.95 \
--served-model-name=caprl \
--port 8000 \
--host 0.0.0.0
Then you can use the chat API as below: (see OpenAI API protocol document for more details):
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
model="caprl",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": base64_qwen
},
},
{"type": "text", "text": "What is the text in the illustrate?"},
],
},
],
temperature=1.0,
max_tokens=max_tokens,
top_p=1.0,
extra_body={
"repetition_penalty": 1.0,
},
)
print("Chat response:", chat_response)
This part of the code is in the QA_data_curation
folder, which contains all four steps for generating QA data:
- QA generation. Use Qwen2.5-VL-72B to generate 5 QAs for each image. The generation process launches a vLLM service and uses multi-threading to speed up.
- QA extraction. Extract QAs through format matching.
- Qwen2.5-VL-3B answer question. Use Qwen2.5-VL-3B to answer questions with and without images. The parameter
ROTATE_NUM
controls how many times each question is answered. If a question is answered only once, the randomness may be too high and can easily lead to misjudgment. - Filter question. We keep QA pairs with
visual acc
higher than 0.75 andtext acc
lower than 0.25 to avoid data leakage and ensure the model can correctly answer questions when images are provided.
Our CapRL-2M dataset is available on : 🔗 Hugging Face
It includes images from ShareGPT-1M and DenseFusion-1M, with high-quality captions re-annotated using CapRL-3B, totaling 2M samples.
In our JSONL files, we provide the captions along with their corresponding image paths. The images can be downloaded from ShareGPT-1M and DenseFusion-1M.
To reproduce the pretraining experiments presented in our paper:
-
Initialize Qwen2.5-VL. Follow the steps in the notebook
initiallize_vlm_3b.ipynb
to set up the Qwen2.5-VL model for training. -
Training. You can then use LLaMAFactory directly to run the training process.
We evaluate caption quality by decoupling the traditional VQA (Visual Question Answering) task:
- First, a model generates a caption for the image.
- Then, a language model answers questions based solely on the generated caption.
This approach allows us to assess the informational quality and completeness of the generated captions — if the language model can accurately answer visual questions based only on the caption, then the caption is likely high-quality.
The complete evaluation scripts can be found in the Prism_Evaluation
folder, with the core implementation located in Eval_CapRL.py
.
The model used for answering questions based on captions is CapRL-Eval-3B, which is a finetuned version of Qwen2.5-VL-3B. When dealing with tasks such as ChartQA (not multiple-choice questions), it provides more stable output formatting.
You can specify --reward-model-path
as the path to CapRL-Eval-3B in Eval_CapRL.py
.




Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
- Open-LLaVA-NeXT: Thanks for the impressive open-source dataset.
- VLMEvalKit: the amazing open-sourced suit for evaluating various LMMs!