🏆 Leaderboard | 📄 KOFFVQA Arxiv | 🤗 KOFFVQA Dataset
This is the official repository for the evaluation code for the paper "KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language"
- 2025-04-25 : Our leaderboard currently finished evaluating total 81 of famous vlm around open- or close- sourced model. Also refactoring the evaluation code to make it easier to use and evaluate much more diverse models.
- 2025-04-01 : Our paper KOFFVQA has been released and accepted to CVPRW 2025, Workshop on Benchmarking and Expanding AI Multimodal Approaches (BEAM 2025) 🎉
- 2025-01-21 : Evaluation code and dataset release
- 2024-12-06: Leaderboard Release!
KOFFVQA is a meticulously curated benchmark for Visual Question Answering (VQA) in Korean, featuring:
- 275 image-question pairs
- 10 diverse tasks
- Explicit criteria-based scoring (no subjective metrics!)
Each sample includes:
- An image
- A corresponding question
- A set of objective evaluation criteria
To ensure consistent and reliable evaluation, KOFFVQA follows a two-stage pipeline:
- A Vision-Language Model (VLM) generates a response.
- A Large Language Model (LLM) judge scores the response based on the given criteria.
This approach enables free-form generation while maintaining objective and reproducible evaluation.
⚠️ If you're using a gated Huggingface model (e.g.,gemma-2-9b-it
), make sure to log in first:
huggingface-cli login
Enter a token with access permission (read
scope) if prompted.
pip install -r requirements.txt
API_KEY=YOUR_API_KEY python generate.py --model gpt-4o
python evaluate.py --predfile result/gpt-4o/KOFFVQA_gen.xlsx
python generate.py --model Qwen/Qwen2.5-VL-7B-Instruct
python evaluate.py --predfile result/Qwen2.5-VL-7B-Instruct/KOFFVQA_gen.xlsx
generate.py
uses the provided model to generate responses for each question in the benchmark, saving it to an output file in .xlsx
format.
API_KEY=YOUR_API_KEY python generate.py \
--model MODEL_NAME \
[--data DATA_PATH] \
[--outfile OUTPUT_FILE_PATH] \
[--verbose]
Supported API Model Families:
- OpenAI:
gpt-4o
,gpt-4o-mini
,gpt-4-turbo
,gpt-4.1
,gpt-4.1-mini
,gpt-4.1-nano
,o1
,o3
,o3-mini
,o4-mini
,gpt-4.5
- Gemini:
gemini-2.5-pro
,gemini-1.5-flash
,gemini-1.5-pro
,gemini-2.0-flash-001
,gemini-2.0-pro-exp-02-05
,gemini-1.5-flash-002
, etc. - Claude:
claude-3-7-sonnet
,claude-3-5-sonnet
,claude-3-opus
,claude-3-haiku
, etc.
The script detects the API type based on the model name and uses the correct backend.
Single GPU or automatic multi-GPU with device_map='auto'
:
python generate.py \
--model MODEL_NAME_OR_PATH \
[--data DATA_PATH] \
[--adapter ADAPTER_PATH] \
[--outfile OUTPUT_FILE_PATH] \
[--verbose]
Multi-GPU with torchrun
:
torchrun --nproc_per_node N generate.py \
--model MODEL_NAME_OR_PATH \
[--data DATA_PATH] \
[--adapter ADAPTER_PATH] \
[--outfile OUTPUT_FILE_PATH] \
[--verbose]
If not an API model, the model name must be either the name of a model on Huggingface, or a path to a local directory containing the model in Huggingface format. Optionally, a path to a directory containing a LoRA adapter may be provided as well.
If there is not enough GPU memory to fit the entire model on one GPU, running generate.py
with python
instead of torchrun
will automatically load the model across multiple GPU devices using Huggingface's device_map='auto'
.
The architectures
attribute of the configuration file of the model is used to determine how to load the model and use it for generation. The following values are supported:
MllamaForConditionalGeneration
Qwen2VLForConditionalGeneration
MiniCPMV
InternVLChatModel
Ovis
InternLMXComposer2ForCausalLM
LlavaNextForConditionalGeneration
LlavaLlamaForCausalLM
*LlavaQwenForCausalLM
*LlavaForConditionalGeneration
**MolmoForCausalLM
Llama4ForConditionalGeneration
Qwen2_5OmniModel
AyaVisionForConditionalGeneration
Idefics3ForConditionalGeneration
Phi3VForCausalLM
AriaForConditionalGeneration
Gemma3ForConditionalGeneration
Phi4MMForCausalLM
Mistral3ForCausalLM
KimiVLForCausalLM
HCXVisionForCausalLM
(*Requires pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
)
(**Only Pixtral models in Huggingface format are supported.)
Some models may not work even with compatible architectures if they've been significantly modified.
Additionally, Pixtral models are supported, but due to their non-standard format, only locally downloaded models are supported. To use Pixtral models, pip install mistral_inference
is required. (Community-made checkpoints that follow the usual Huggingface format can be used in the same way as other models, where the architecture will be LlavaForConditionalGeneration
.)
- Default input:
data/KOFFVQA.tsv
(automatically downloaded if missing) - Default output:
result/{model_name}_{adapter}/{benchmark_name}_gen.xlsx
evaluate.py
takes the output file of the generation process as input, and uses a local judge LLM to grade the responses using each question's grading criteria.
Single GPU or automatic multi-GPU with device_map='auto'
:
python evaluate.py \
--predfile INPUT_FILE_PATH \
[--outfile JUDGEMENT_FILE_PATH] \
[--scorefile SCORE_FILE_PATH] \
[--judge JUDGE_NAME] \
[--verbose]
Multi-GPU with torchrun
:
torchrun --nproc_per_node N evaluate.py \
--predfile INPUT_FILE_PATH \
[--outfile JUDGEMENT_FILE_PATH] \
[--scorefile SCORE_FILE_PATH] \
[--judge JUDGE_NAME] \
[--verbose]
Default judge: google/gemma-2-9b-it
(requires Huggingface login)
Chosen for:
- Good Korean language understanding
- Lightweight enough to run locally
- Judgement file (
.xlsx
): Full LLM responses for each evaluation (for debugging) - Score file (
.csv
): Numeric evaluation scores for each task + overall benchmark
- Fine-tune the judge model to improve grading accuracy
- Add more data to the benchmark (KOFFVQA-HARD)
BibTeX:
@article{kim2025koffvqa,
title={KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language},
author={Kim, Yoonshik and Jung, Jaeyoon},
journal={arXiv preprint arXiv:2503.23730},
year={2025}
}