A simple and efficient framework for change detection and captioning tasks.
Duowang Zhu1, Xiaohu Huang2, Haiyan Huang1, Hao Zhou3, and Zhenfeng Shao1*
1 Wuhan University 2 The University of Hong Kong 3 Bytedance
- Unified Framework: Supports multiple change detection and captioning tasks.
- Highly Efficient: Uses ~6–13% of the parameters and ~8–34% of the FLOPs compared to SOTA.
- SOTA Performance: Achieves SOTA performance without complex structures, offering an alternative to 2D models.
-
[2025.03.25] We have released all the training codes of Change3D!
-
[2025.02.27] Change3D has been accepted by CVPR 2025! 🎉🎉
We present Change3D, a unified video-based framework for change detection and captioning. Unlike traditional methods that use separate image encoders and multiple change extractors, Change3D treats bi-temporal images as a short video with learnable perception frames. A video encoder enables direct interaction and difference detection, simplifying the architecture. Our approach supports various tasks, including binary change detection (BCD), semantic change detection (SCD), building damage assessment (BDA), and change captioning (CC). Evaluated on eight benchmarks, Change3D outperforms SOTA methods while using only ~6%–13% of the parameters and ~8%–34% of the FLOPs.
Figure 1. Overall architectures of Change3D for Binary Change Detection, Semantic Change Detection, Building Damage Assessment, and Change Captioning.
We conduct extensive experiments on eight public datasets: LEVIR-CD, WHU-CD, CLCD, HRSCD, SECOND, xBD, LEVIR-CC, and DUBAI-CC.
conda create -n Change3D python=3.11.0
conda activate Change3D
pip install -r requirements.txt
Download the X3D-L weight and put it into the root directory.
- For BCD: Download LEVIR-CD, WHU-CD and CLCD datasets. Prepare the dataset into the following structure and crop each image into 256x256 patches.
├─Train
├─t1 jpg/png (input image of T1)
├─t2 jpg/png (input image of T2)
└─label jpg/png (binary change mask)
├─Val
├─t1
├─t2
└─label
├─Test
├─t1
├─t2
└─label
- For SCD: Download HRSCD and SECOND datasets. Prepare the dataset into the following structure and crop each image into 256x256 patches.
├─Train
├─t1 jpg/png (input image of T1)
├─t2 jpg/png (input image of T2)
├─label1 jpg/png (semantic mask of T1)
├─label2 jpg/png (semantic mask of T2)
└─change jpg/png (binary change mask)
...
├─Test
├─t1
├─t2
├─label1
├─label2
└─change
- For BDA: Download xBD dataset. Prepare the dataset into the following structure and crop each image into 256x256 patches.
├─Train
├─t1 jpg/png (input image of T1)
├─t2 jpg/png (input image of T2)
├─label1 jpg/png (damage localization mask)
└─label2 jpg/png (damage level mask)
...
├─Test
├─t1
├─t2
├─label1
└─label2
- For CC: Download LEVIR-CC and DUBAI-CC datasets. Then follow the practice introduced in RSICCformer.
Training binary change detection with LEVIR-CD dataset as an example:
python ./scripts/train_BCD.py --dataset LEVIR-CD
--file_root path/to/LEVIR-CD
--pretrained path/to/X3D_L.pyth
--save_dir ./exp
--gpu_id 0
Note: The above train script completes the evaluation automatically.
This repository is mainly built upon pytorchvideo and RSICCformer. Thanks for those well-organized codebases.
If you have any issues while using the project, please feel free to contact me: [email protected].
Change3D is released under the CC BY-NC-SA 4.0 license
.
If you find our work useful, please consider citing our paper:
@inproceedings{zhu2025change3d,
title={Change3D: Revisiting Change Detection and Captioning from A Video Modeling Perspective},
author={Zhu, Duowang and Huang, Xiaohu and Huang, Haiyan and Zhou, Hao and Shao, Zhenfeng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}