Skip to content

[NeurIPS'25] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

License

Notifications You must be signed in to change notification settings

KlingTeam/VidEmo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Zhicheng Zhang1,†, Weicheng Wang1, Yongjie Zhu3,‡, Wenyu Qin3, Pengfei Wan3, Di Zhang3, Jufeng Yang1,2,✉
1Nankai University      2Pengcheng Laboratory      3Kuaishou Technology     
Work done at KlingAI      Project Leader      Corresponding Author     

🎉 Accepted by NeurIPS 2025 🎉

arXiv Website Github Awesome HF Dataset: Emo-CFG 2.1M
HF Model: VidEmo Family HF Model: VidEmo Family HF Dataset: Emo-CFG 2.1M

TL;DR: We present an emotion-centric video foundation model trained with fine-grained captions and rationales via affective-tree reasoning guidance, achieving high-level emotional intelligence for video understanding.

📈 1. News

  • 🔥2025-12-01: Pre-computed results, inference, and evaluation code released.
  • 🔥2025-12-01: Creating repository.
  • 2025-09-18: Kling-VidEmo has been accepted to NeurIPS 2025!

⚒️ 2. Environment Setup

conda create -n VidEmo python=3.9
conda activate VidEmo
python -m pip install -r requirements.txt
cd ms-swift
python -m pip install -e .

💾 3. Emo-CFG Datasets

🔐 3.1 Overview of dataset

In (a), the data taxonomy organizes the dataset into three primary face perception tasks: Emotion Intelligence, Expression Analysis, and Attribution Perception, covering a wide range of facial features and emotional attributes. (b) The data distribution plots show the relative face area and video duration across different datasets, illustrating the diversity and variety of video data present in Emo-CFG. (c) The annotation distribution includes the breakdown of facial views (head, half, full) and video length, accompanied by a word cloud highlighting the most frequently annotated terms, such as “neutral”, “face”, and “expression”. (d) Data statistics compares Emo-CFG with other emotion and video datasets, showing that Emo-CFG provides a richer set of annotations and label types, including fine-grained emotion, rationales, and comprehensive video data, making it a unique and valuable resource for emotion-centric research.

The dataset folder should be structured as follow:

Emo-CFG
├── jsons
│   ├── curation
│   │   ├── concat_receipt.py
│   │   ├── v1
│   │   │   └── source.txt
│   │   ├── v2
│   │   │   └── source.txt
│   │   ├── v3
│   │   │   └── source.txt
│   │   ├── v4
│   │   │   └── source.txt
│   │   └── v5
│   ├── test
│   │   ├── attribute
│   │   │   ├── full
│   │   │   └── sampled
│   │   ├── caption
│   │   │   ├── full
│   │   │   └── sampled
│   │   ├── emotion
│   │   │   ├── full
│   │   │   └── sampled
│   │   └── qa
│   │       ├── full
│   │       └── sampled
│   └── train
│       ├── attribute
│       │   ├── full
│       │   └── sampled
│       ├── caption
│       │   ├── full
│       │   └── sampled
│       ├── emotion
│       │   ├── full
│       │   └── sampled
│       ├── qa
│       │   ├── full
│       │   └── sampled
│       └── rationale
│           ├── full
│           └── sampled
└── videos
    ├── AFEW
    ├── AffWild2
    ├── CAER
    ├── CASME
    ├── CAS(ME)2
    ├── CASME2
    ├── CelebV-HQ
    ├── CelebV-Text
    ├── Dfew
    ├── FERV39K
    ├── MAFW
    ├── MEAD
    ├── MELD
    ├── Mer2023
    ├── MOSEI
    ├── MOSI
    ├── PERR
    ├── RAVDESS
    └── SIMS

🔐 3.2 Access & License

To access the dataset, you must upload a signed End User License Agreement (EULA) via our HuggingFace repository:

👉 Emo-CFG on HuggingFace

⚠️ Note: The copyright of the videos remains with the original owners. If you find this work useful, please consider cite our paper and acknowledging the related dataset resources kindly.

🔬 4. VidEmo Family

🧊 4.1 Model Collection

To use the model weights, download them from Hugging Face:

🔮 4.2 Train

🧱 SFT Stage

TBD

🧱 RL Stage

TBD

🔮 4.3 Inference

📜 Scripts

Run the following command to perform inference.

Note: Ensure that the path variables (e.g., ${BASE_DATASET_DIR}) are defined or replaced with your actual file paths before running.

VIDEO_MAX_PIXELS=100352 FPS_MAX_FRAMES=16 CUDA_VISIBLE_DEVICES=0 swift infer \
    --val_dataset "${BASE_DATASET_DIR}/${TESTING_DATASET_NAME}" \
    --ckpt_dir "${BASE_CKPT_DIR}/${TESTING_MODEL_NAME}" \
    --result_path "${RESULT_PATH}" \
    --infer_backend vllm \
    --gpu_memory_utilization 0.85 \
    --torch_dtype bfloat16 \
    --max_new_tokens 2048 \
    --streaming False \
    --max_batch_size 4 \
    --attn_impl flash_attn \
    --limit_mm_per_prompt '{"image": 0, "video": 1}' \
    --max_model_len 49152

For a complete batch processing script, please refer to scripts/inference.sh

📊 Pre-computed VidEmo and SOTA Results

To facilitate fair comparison and ensure alignment with our reported metrics, we provide the original inference outputs used in our paper. Please refer to the resutls folder.

Note on Evaluation: You may use your own GPT version/API key for evaluation. We have observed that while absolute scores may vary for open-form QA data, the relative ranking of the results remains consistent across different GPT versions.

🔮 4.4 Evaluation

Demonstration
eval
├─ config.py # GPT configuration
├─ eval_results.py # Evaluation scripts
├─ generate_table.py # CSV & table generator
└─ util.py # Utility functions
  1. Modify the LLM configuration in config.py

    • Modify API_KEY to your API key
    • Modify BASE_URL to your LLM's base URL
    • We recommend setting MODEL_NAME to gpt-4o-2024-08-06 to better align with the reported results.
  2. Execute the evaluation scripts

    python -m eval.eval_results \
    	--input_dir "Path/to/the/input/directory" \
    	--method "method name, e.g. models--Qwen--Qwen2.5-VL-7B-Instruct" \
    	--output_dir "Path/to/the/output/txt/directory" \
    	--retry 50 \ # Optional, maximum retry number
    	--max_concurrency  # Optional, maximum concurrent requests

    By default, this script will evaluate all tasks defined in config.py Class Tasks. You may find example usage for evaluating a specific task in eval_results.py line 348.

  3. Export the results to CSV files and generate tables

    python -m eval.generate_table \
    	--input_dir "Path/to/where/all/txt/files/stay" \
    	--csv_file_dir "Path/to/the/target/directory/of/csv/file" \ # Optional, default to "input_dir"
    	--table_file_dir "Path/to/the/target/directory/of/table/file"

    This will generate an output.csv CSV file under csv_file_dir and a table.txt file under table_file_dir.

Notes
  1. The QA evaluation relies on the ground truth annotation file. This is defined in config.py under Tasks.QA.gt_file. Please also modify this path for a successful evaluation.
  2. To customize your own evaluation task, please add another instance of EvalTask under the Tasks class located in the config.py file.

⭐ 5. Star History

Star History Chart

📫 6. Contact

If you have any questions, please feel free to contact:

🏷️ 7. Citation

If you find this project useful, please consider citing:

@inproceedings{zhang2025VidEmo,
  author = {Zhang, Zhicheng and Wang, Weicheng and Zhu, Yongjie and Qin, Wenyu and Wan, Pengfei and Zhang, Di and Yang, Jufeng},
  title = {VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models},
  booktitle = {Advances in Neural Information Processing Systems},
  year = {2025}
}

🥰 8. Acknowledgements

This project stands on the shoulders of giants. We deeply appreciate the ms-swift library for their excellent codebase. Our dataset is constructed based on the following foundational resources in affective computing. We sincerely thank the authors of these datasets:

AFEW AffWild2 CAER CASME
CAS(ME)² CASME2 CelebV-HQ CelebV-Text
DFEW FERV39K MAFW MEAD
MELD MER2023 MOSEI MOSI
PERR RAVDESS SIMS

About

[NeurIPS'25] VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages