🎉 Accepted by NeurIPS 2025 🎉
TL;DR: We present an emotion-centric video foundation model trained with fine-grained captions and rationales via affective-tree reasoning guidance, achieving high-level emotional intelligence for video understanding.
- 🔥2025-12-01: Pre-computed results, inference, and evaluation code released.
- 🔥2025-12-01: Creating repository.
- 2025-09-18: Kling-VidEmo has been accepted to NeurIPS 2025!
conda create -n VidEmo python=3.9
conda activate VidEmo
python -m pip install -r requirements.txt
cd ms-swift
python -m pip install -e .
In (a), the data taxonomy organizes the dataset into three primary face perception tasks: Emotion Intelligence, Expression Analysis, and Attribution Perception, covering a wide range of facial features and emotional attributes. (b) The data distribution plots show the relative face area and video duration across different datasets, illustrating the diversity and variety of video data present in Emo-CFG. (c) The annotation distribution includes the breakdown of facial views (head, half, full) and video length, accompanied by a word cloud highlighting the most frequently annotated terms, such as “neutral”, “face”, and “expression”. (d) Data statistics compares Emo-CFG with other emotion and video datasets, showing that Emo-CFG provides a richer set of annotations and label types, including fine-grained emotion, rationales, and comprehensive video data, making it a unique and valuable resource for emotion-centric research.
The dataset folder should be structured as follow:
Emo-CFG
├── jsons
│ ├── curation
│ │ ├── concat_receipt.py
│ │ ├── v1
│ │ │ └── source.txt
│ │ ├── v2
│ │ │ └── source.txt
│ │ ├── v3
│ │ │ └── source.txt
│ │ ├── v4
│ │ │ └── source.txt
│ │ └── v5
│ ├── test
│ │ ├── attribute
│ │ │ ├── full
│ │ │ └── sampled
│ │ ├── caption
│ │ │ ├── full
│ │ │ └── sampled
│ │ ├── emotion
│ │ │ ├── full
│ │ │ └── sampled
│ │ └── qa
│ │ ├── full
│ │ └── sampled
│ └── train
│ ├── attribute
│ │ ├── full
│ │ └── sampled
│ ├── caption
│ │ ├── full
│ │ └── sampled
│ ├── emotion
│ │ ├── full
│ │ └── sampled
│ ├── qa
│ │ ├── full
│ │ └── sampled
│ └── rationale
│ ├── full
│ └── sampled
└── videos
├── AFEW
├── AffWild2
├── CAER
├── CASME
├── CAS(ME)2
├── CASME2
├── CelebV-HQ
├── CelebV-Text
├── Dfew
├── FERV39K
├── MAFW
├── MEAD
├── MELD
├── Mer2023
├── MOSEI
├── MOSI
├── PERR
├── RAVDESS
└── SIMS
To access the dataset, you must upload a signed End User License Agreement (EULA) via our HuggingFace repository:
⚠️ Note: The copyright of the videos remains with the original owners. If you find this work useful, please consider cite our paper and acknowledging the related dataset resources kindly.
To use the model weights, download them from Hugging Face:
TBD
TBD
Run the following command to perform inference.
Note: Ensure that the path variables (e.g.,
${BASE_DATASET_DIR}) are defined or replaced with your actual file paths before running.
VIDEO_MAX_PIXELS=100352 FPS_MAX_FRAMES=16 CUDA_VISIBLE_DEVICES=0 swift infer \
--val_dataset "${BASE_DATASET_DIR}/${TESTING_DATASET_NAME}" \
--ckpt_dir "${BASE_CKPT_DIR}/${TESTING_MODEL_NAME}" \
--result_path "${RESULT_PATH}" \
--infer_backend vllm \
--gpu_memory_utilization 0.85 \
--torch_dtype bfloat16 \
--max_new_tokens 2048 \
--streaming False \
--max_batch_size 4 \
--attn_impl flash_attn \
--limit_mm_per_prompt '{"image": 0, "video": 1}' \
--max_model_len 49152For a complete batch processing script, please refer to scripts/inference.sh
To facilitate fair comparison and ensure alignment with our reported metrics, we provide the original inference outputs used in our paper. Please refer to the resutls folder.
Note on Evaluation: You may use your own GPT version/API key for evaluation. We have observed that while absolute scores may vary for open-form QA data, the relative ranking of the results remains consistent across different GPT versions.
eval
├─ config.py # GPT configuration
├─ eval_results.py # Evaluation scripts
├─ generate_table.py # CSV & table generator
└─ util.py # Utility functions
-
Modify the LLM configuration in
config.py- Modify
API_KEYto your API key - Modify
BASE_URLto your LLM's base URL - We recommend setting
MODEL_NAMEtogpt-4o-2024-08-06to better align with the reported results.
- Modify
-
Execute the evaluation scripts
python -m eval.eval_results \ --input_dir "Path/to/the/input/directory" \ --method "method name, e.g. models--Qwen--Qwen2.5-VL-7B-Instruct" \ --output_dir "Path/to/the/output/txt/directory" \ --retry 50 \ # Optional, maximum retry number --max_concurrency # Optional, maximum concurrent requests
By default, this script will evaluate all tasks defined in
config.py Class Tasks. You may find example usage for evaluating a specific task ineval_results.pyline 348. -
Export the results to CSV files and generate tables
python -m eval.generate_table \ --input_dir "Path/to/where/all/txt/files/stay" \ --csv_file_dir "Path/to/the/target/directory/of/csv/file" \ # Optional, default to "input_dir" --table_file_dir "Path/to/the/target/directory/of/table/file"
This will generate an
output.csvCSV file undercsv_file_dirand atable.txtfile undertable_file_dir.
- The QA evaluation relies on the ground truth annotation file. This is defined in
config.pyunderTasks.QA.gt_file. Please also modify this path for a successful evaluation. - To customize your own evaluation task, please add another instance of
EvalTaskunder theTasksclass located in theconfig.pyfile.
If you have any questions, please feel free to contact:
- Zhicheng Zhang: [email protected]
- Weicheng Wang: [email protected]
If you find this project useful, please consider citing:
@inproceedings{zhang2025VidEmo,
author = {Zhang, Zhicheng and Wang, Weicheng and Zhu, Yongjie and Qin, Wenyu and Wan, Pengfei and Zhang, Di and Yang, Jufeng},
title = {VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025}
}This project stands on the shoulders of giants. We deeply appreciate the ms-swift library for their excellent codebase. Our dataset is constructed based on the following foundational resources in affective computing. We sincerely thank the authors of these datasets:
| AFEW | AffWild2 | CAER | CASME |
|---|---|---|---|
| CAS(ME)² | CASME2 | CelebV-HQ | CelebV-Text |
| DFEW | FERV39K | MAFW | MEAD |
| MELD | MER2023 | MOSEI | MOSI |
| PERR | RAVDESS | SIMS |
