BMIP: Bi-directional Modality Interaction Prompt Learning for VLM
Song-Lin Lv, Yu-Yang Chen, Zhi Zhou, Ming Yang, Lan-Zhe Guo
Official implementation of the paper "BMIP: Bi-directional Modality Interaction Prompt Learning for VLM".
- (Aug 16, 2025)
- Paper accepted at IJCAI 2025 🎉
- (Aug 14, 2025)
- The repository also supports CoOp, Co-CoOp, Deep Vision Prompting, Deep Language Prompting, and Independent V-L Prompting architectures.
Abstract: Vision-language models (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called Bi-directional Modality Interaction Prompt (BMIP), which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.
- Novel Bi-directional Modality Interaction Technique
- Enhance the cross-modality alignment and pave the way for further exploration of information aggregation in other multi-modal modelsNew
- Evaluation Paradigm: Open-World Generalization
- Facilitate more realistic evaluations and promote related research
- Flexible Integration with Other Methods
- BMIP is flexible enough to combine with other prompt learning methods, consistently boosting their performance.
- State-of-the-Art Performance
- BMIP achieves SOTA performance across all tasks
Method | Paper | Configs | Training Scripts |
---|---|---|---|
BMIP | IJCAI 2025 | link | link |
MaPLe | CVPR 2023 | link | link |
CoOp | IJCV 2022 | link | link |
Co-CoOp | CVPR 2022 | link | link |
Deep Vision Prompting | - | link | link |
Deep Language Prompting | - | link | link |
Independent V-L Prompting | - | link | link |
Results reported below show accuracy for base and novel classes for across 11 recognition datasets averaged over 3 seeds.
Name | Base Acc. | Novel Acc. | HM | Epochs |
---|---|---|---|---|
CLIP | 69.34 | 74.22 | 71.70 | - |
CoOp | 82.69 | 63.22 | 71.66 | 200 |
CoCoOp | 80.47 | 71.69 | 75.83 | 10 |
MaPLe | 82.28 | 75.14 | 78.55 | 5 |
BMIP | 83.47 | 76.69 | 79.04 | 10 |
For installation and other package requirements, please follow the instructions detailed in INSTALL.md.
Please follow the instructions at DATASETS.md to prepare all datasets.
Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results using our pre-trained models.
If you use our work, please consider citing:
@misc{lv2025bmipbidirectionalmodalityinteraction,
title={BMIP: Bi-directional Modality Interaction Prompt Learning for VLM},
author={Song-Lin Lv and Yu-Yang Chen and Zhi Zhou and Ming Yang and Lan-Zhe Guo},
year={2025},
eprint={2501.07769},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2501.07769},
}
If you have any questions, please create an issue on this repository or contact at [email protected].
Our code is based on Co-CoOp and CoOp repository. We thank the authors for releasing their code. If you use our model and code, please consider citing these works as well.