Skip to content

The official repo for "Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes", ECCV 2024

License

Notifications You must be signed in to change notification settings

GeWu-Lab/Ref-AVS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

82eb688 · Dec 4, 2024

History

17 Commits
Jul 14, 2024
Jul 15, 2024
Jul 14, 2024
Jul 14, 2024
Jul 14, 2024
Jul 15, 2024
Jul 14, 2024
Jul 14, 2024
Jul 2, 2024
Dec 4, 2024
Jul 14, 2024
Jul 14, 2024
Jul 14, 2024
Jul 14, 2024

Repository files navigation

Ref-AVS

The official repo for "Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes", ECCV 2024

>>> Introduction

In this paper, we propose a pixel-level segmentation task called Referring Audio-Visual Segmentation (Ref-AVS), which requires the network to densely predict whether each pixel corresponds to the given multimodal-cue expression, including dynamic audio-visual information.

  • Top-left of Fig.1 highlights the distinctions between Ref-AVS and previous tasks. Fig.1 Teaser

  • Fig.2 shows the proposed baseline model to process multimodal-cues. Fig.2 Baseline

  • Fig.3 shows the statistics of this dataset. Fig.3 Statistics

>>> Run

Run the training & evaluation:

cd Ref_AVS
sh run.sh  # you should change your path configs. See /configs/config.py for more details.

You can download the checkpoint here.

Core dependencies:

transformers=4.30.2
towhee=1.1.3
towhee-models=1.1.3  # Towhee is used for extracting VGGish audio feature.

>>> FAQ

(1) Alternative Audio Feature Extraction

If you found the towhee is hard to establish, please consider using the following code with Google CoLab: link.

Citation

If you find this work useful, please consider citing it:

@article{wang2024refavs,
  title={Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes},
  author={Wang, Yaoting and Sun, Peiwen and Zhou, Dongzhan and Li, Guangyao and Zhang, Honggang and Hu, Di},
  journal={IEEE European Conference on Computer Vision (ECCV)},
  year={2024},
}

@inproceedings{wang2024prompting,
  title={Prompting segmentation with sound is generalizable audio-visual source localizer},
  author={Wang, Yaoting and Liu, Weisong and Li, Guangyao and Ding, Jian and Hu, Di and Li, Xi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={6},
  pages={5669--5677},
  year={2024}
}

About

The official repo for "Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes", ECCV 2024

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published