Visual Grounding/Localization

Resources: [yawenzeng/Awesome-Cross-Modal-Video-Moment-Retrieval].
Resources: [Soldelli/Awesome-Temporal-Language-Grounding-in-Videos].

Survey

[2020 ICCST] A Survey of Temporal Activity Localization via Language in Untrimmed Videos, [paper], [bibtex].
[2021 ArXiv] A Survey on Natural Language Video Localization, [paper], [bibtex].
[2021 ArXiv] A Survey on Temporal Sentence Grounding in Videos, [paper], [bibtex].
[2022 ArXiv] The Elements of Temporal Sentence Grounding in Videos: A Survey and Future Directions, [paper], [bibtex].

Temporal Video Grounding

[2017 ICCV] Localizing Moments in Video with Natural Language, [paper], [bibtex], sources: [LisaAnne/LocalizingMoments].
[2017 ICCV] TALL: Temporal Activity Localization via Language Query, [paper], [bibtex], sources: [jiyanggao/TALL].
[2018 EMNLP] Localizing Moments in Video with Temporal Language, [paper], [bibtex], [supplementary], sources: [LisaAnne/TemporalLanguageRelease].
[2018 EMNLP] Temporally Grounding Natural Sentence in Video, [paper], [bibtex].
[2018 ECCV] Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos, [paper], [bibtex], [homepage].
[2018 ECCV] Find and Focus: Retrieve and Localize Video Events with Natural Language Queries, [paper], [bibtex].
[2018 ACMMM] Cross-modal Moment Localization in Videos, [paper], [bibtex].
[2018 ArXiv] Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions, [paper], [bibtex], sources: [NeonKrypton/ASST].
[2018 ArXiv] Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning, [paper], [bibtex].
[2018 SIGIR] Attentive Moment Retrieval in Videos, [paper], [bibtex], [slides], [codes].
[2019 WACV] MAC: Mining Activity Concepts for Language-based Temporal Localization, [paper], [bibtex], sources: [runzhouge/MAC].
[2019 NAACL] ExCL: Extractive Clip Localization Using Natural Language Descriptions, [paper], [bibtex].
[2019 CVPR] MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment, [paper], [bibtex].
[2019 CVPR] Language-driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model, [paper], [bibtex].
[2019 AAAI] Localizing Natural Language in Videos, [paper], [bibtex].
[2019 AAAI] Multilevel Language and Vision Integration for Text-to-Clip Retrieval, [paper], [bibtex], sources: [VisionLearningGroup/Text-to-Clip_Retrieval].
[2019 AAAI] To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression, [paper], [bibtex].
[2019 AAAI] Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos, [paper], [bibtex].
[2019 AAAI] Semantic Proposal for Activity Localization in Videos via Sentence Query, [paper], [bibtex].
[2019 ACMMM] Exploiting Temporal Relationships in Video Moment Localization with Natural Language, [paper], [bibtex], sources: [Sy-Zhang/TCMN-Release].
[2019 SIGIR] Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos, [paper], [bibtex], sources: [ikuinen/CMIN_moment_retrieval].
[2019 EMNLP] DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization, [paper], [bibtex].
[2019 NeurIPS] Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos, [paper], [bibtex], sources: [yytzsy/SCDM].
[2020 BMVC] Tripping through time: Efficient Localization of Activities in Videos, [paper], [bibtex].
[2020 WACV] Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention, [paper], [bibtex], sources: [crodriguezo/TMLGA].
[2020 AAAI] Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language, [paper], [bibtex], sources: [microsoft/2D-TAN], [ChenJoya/2dtan].
[2020 AAAI] Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video, [paper], [bibtex], sources: [WuJie1010/TSP-PRL].
[2020 AAAI] Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction, [paper], [bibtex], sources: [JaywongWang/CBP].
[2020 AAAI] Rethinking the Bottom-Up Framework for Query-based Video Localization, [paper], [bibtex].
[2020 ACL] Span-based Localizing Network for Natural Language Video Localization, [paper], [bibtex], sources: [IsaacChanghau/VSLNet].
[2020 CVPR] Dense Regression Network for Video Grounding, [paper], [bibtex], [supplementary], sources: [Alvin-Zeng/DRN].
[2020 CVPR] Local-Global Video-Text Interactions for Temporal Grounding, [paper], [bibtex], sources: [JonghwanMun/LGI4temporalgrounding].
[2020 ECCV] Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos, [paper], [bibtex].
[2020 ACMMM] STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization, [paper], [bibtex].
[2020 ACMMM] Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization, [paper], [bibtex].
[2020 ACMMM] Dual Path Interaction Network for Video Moment Localization, [paper], [bibtex].
[2020 ACMMM] Fine-grained Iterative Attention Network for Temporal Language Localization in Videos, [paper], [bibtex].
[2020 ACMMM] Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization, [paper], [bibtex], sources: [liudaizong/CSMGAN].
[2020 COLING] Reasoning Step-by-Step: Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network, [paper], [bibtex].
[2020 TPAMI] Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos, [paper], [bibtex], sources: [yytzsy/SCDM].
[2020 ArXiv] A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention, [paper], [bibtex].
[2020 ArXiv] Boundary-sensitive Pre-training for Temporal Localization in Videos, [paper], [bibtex].
[2020 ArXiv] Multi-Scale 2D Temporal Adjacent Networks for Moment Localization with Natural Language, [paper], [bibtex], sources: [microsoft/2D-TAN], [ChenJoya/2dtan].
[2020 ArXiv] VLG-Net: Video-Language Graph Matching Network for Video Grounding, [paper], [bibtex].
[2020 JNCA] Context-aware Network with Foreground Recalibration for Grounding Natural Language in Video, [paper], [bibtex].
[2021 WACV] DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video, [paper], [bibtex], sources: [crodriguezo/dori].
[2021 AAAI] Boundary Proposal Network for Two-Stage Natural Language Video Localization, [paper], [bibtex].
[2021 AAAI] Proposal-Free Video Grounding with Contextual Pyramid Network, [paper], [bibtex].
[2021 TMM] Frame-wise Cross-modal Match for Video Moment Retrieval, [paper], [bibtex], sources: [tanghaoyu258/ACRM-for-moment-retrieval].
[2021 TIP] Interaction-Integrated Network for Natural Language Moment Localization, [paper], [bibtex].
[2021 TPAMI] Natural Language Video Localization: A Revisit in Span-based Question Answering Framework, [paper], [bibtex].
[2021 CVPR] Context-aware Biaffine Localizing Network for Temporal Sentence Grounding, [paper], [bibtex], sources: [liudaizong/CBLN].
[2021 CVPR] Interventional Video Grounding with Dual Contrastive Learning, [paper], [bibtex], [supplementary], sources: [nanguoshun/IVG].
[2021 CVPR] Cascaded Prediction Network via Segment Tree for Temporal Video Grounding, [paper], [bibtex], [supplementary].
[2021 CVPR] Multi-stage Aggregated Transformer Network for Temporal Language Localization in Videos, [paper], [bibtex].
[2021 ACL] Parallel Attention Network with Sequence Matching for Video Grounding, [paper], [bibtex], sources: [IsaacChanghau/SeqPAN].
[2021 EMNLP] Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos, [paper], [bibtex].
[2021 EMNLP] Natural Language Video Localization with Learnable Moment Proposals, [paper], [bibtex].
[2021 EMNLP] Relation-aware Video Reading Comprehension for Temporal Language Grounding, [paper], [bibtex], sources: [Huntersxsx/RaNet].
[2021 ICME] Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval, [paper], [bibtex].
[2021 ICCV] Fast Video Moment Retrieval, [paper], [bibtex].
[2021 ArXiv] Progressive Localization Networks for Language-based Moment Localization, [paper], [bibtex].
[2021 ArXiv] Decoupled Spatial Temporal Graphs for Generic Visual Grounding, [paper], [bibtex].
[2021 ArXiv] VGNMN: Video-grounded Neural Module Network to Video-Grounded Language Tasks, [paper], [bibtex].
[2021 ArXiv] End-to-end Multi-modal Video Temporal Grounding, [paper], [bibtex].
[2021 ArXiv] QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries, [paper], [bibtex], sources: [jayleicn/moment_detr].
[2021 ArXiv] Hierarchical Deep Residual Reasoning for Temporal Moment Localization, [paper], [bibtex].
[2021 ArXiv] MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions, [paper], [bibtex], sources: [Soldelli/MAD].
[2022 NeuroComputing] STCM-Net: A Symmetrical One-stage Network for Temporal Language Localization in Videos, [paper], [bibtex].
[2022 ArXiv] Explore and Match: End-to-End Video Grounding with Transformer, [paper], [bibtex].

Weakly/Self/Semi/Un- Supervised Temporal Video Grounding

[2015 ICCV] Weakly-Supervised Alignment of Video With Text, [paper], [bibtex].
[2018 NeurIPS] Weakly Supervised Dense Event Captioning in Videos, [paper], [bibtex], sources: [XgDuan/WSDEC].
[2019 CVPR] Weakly Supervised Video Moment Retrieval From Text Queries, [paper], [bibtex], sources: [niluthpol/weak_supervised_video_moment].
[2019 EMNLP] WSLLN: Weakly Supervised Natural Language Localization Networks, [paper], [bibtex].
[2020 AAAI] Weakly-Supervised Video Moment Retrieval via Semantic Completion Network, [paper], [bibtex].
[2020 ArXiv] Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video, [paper], [bibtex].
[2020 ArXiv] Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos, [paper], [bibtex].
[2020 ACMMM] Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos, [paper], [bibtex], sources: [ikuinen/regularized_two-branch_proposal_network].
[2020 ACMMM] Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos, [paper], [bibtex].
[2021 WACV] LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval, [paper], [bibtex].
[2021 ACMMM] AsyNCE: Disentangling False-Positives for Weakly-Supervised Video Grounding, [paper], [bibtex].
[2021 ACMMM] Towards Bridging Video and Language by Caption Generation and Sentence Localization, [paper], [bibtex].
[2021 ACMMM] Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval, [paper], [bibtex].
[2021 CVPR] Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning, [paper], [bibtex], [supplementary].
[2021 ICCV] Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation, [paper], [bibtex].
[2021 ArXiv] Self-supervised Learning for Semi-supervised Temporal Language Grounding, [paper], [bibtex].
[2021 EMNLP] Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding, [paper], [bibtex].
[2022 AAAI] Unsupervised Temporal Video Grounding with Deep Semantic Clustering, [paper], [bibtex].

Bias in Temporal Video Grounding

[2020 BMVC] Uncovering Hidden Challenges in Query-Based Video Moment Retrieval, [paper], [bibtex], [homepage], sources: [mayu-ot/hidden-challenges-MR].
[2021 ArXiv] A Closer Look at Temporal Sentence Grounding in Videos: Datasets and Metrics, [paper], [bibtex], sources: [yytzsy/grounding_changing_distribution].
[2021 CVPR] Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding, [paper], [bibtex].
[2021 SIGIR] Deconfounded Video Moment Retrieval with Causal Intervention, [paper], [bibtex], sources: [Xun-Yang/Causal_Video_Moment_Retrieval].

Spatio-Temporal Video Grounding

[2019 ACL] Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video, [paper], [bibtex], [supplementary], sources: [JeffCHEN2017/WSSTG].
[2020 CVPR] Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences, [paper], [bibtex], sources: [Guaranteer/VidSTG-Dataset].
[2020 ArXiv] Human-centric Spatio-Temporal Video Grounding With Visual Transformers, [paper], [bibtex], sources: [tzhhhh123/HC-STVG].
[2020 ACMMM] Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos, [paper], [bibtex].
[2021 ICCV] STVGBert: A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding, [paper], [bibtex].
[2022 CVPR] TubeDETR: Spatio-Temporal Video Grounding with Transformers, [paper], [bibtex], [homepage], sources: [antoyang/TubeDETR].

Video Corpus Moment Retrieval (Video Retrieval + Moment Localization)

[2019 ICCV] Temporal Localization of Moments in Video Collections with Natural Language, [paper], [bibtex], sources: [escorciav/moments-retrieval-page].
[2020 ECCV] TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval, [paper], [bibtex], [homepage], sources: [jayleicn/TVRetrieval].
[2020 EMNLP] HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training, [paper], [bibtex], sources: [linjieli222/HERO].
[2020 ArXiv] A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus, [paper], [bibtex].
[2021 SIGIR] Video Corpus Moment Retrieval with Contrastive Learning, [paper], [bibtex], sources: [IsaacChanghau/ReLoCLNet].
[2021 ACL] mTVR: Multilingual Moment Retrieval in Videos, [paper], [bibtex], sources: [jayleicn/mTVRetrieval].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

video_grounding.md

video_grounding.md

Visual Grounding/Localization

Survey

Temporal Video Grounding

Weakly/Self/Semi/Un- Supervised Temporal Video Grounding

Bias in Temporal Video Grounding

Spatio-Temporal Video Grounding

Video Corpus Moment Retrieval (Video Retrieval + Moment Localization)

Other Video Groundings

Video Re-localization

Audio based Temporal Video Grounding

Image based Temporal Video Grounding

Sign Language Localization

Files

video_grounding.md

Latest commit

History

video_grounding.md

File metadata and controls

Visual Grounding/Localization

Survey

Temporal Video Grounding

Weakly/Self/Semi/Un- Supervised Temporal Video Grounding

Bias in Temporal Video Grounding

Spatio-Temporal Video Grounding

Video Corpus Moment Retrieval (Video Retrieval + Moment Localization)

Other Video Groundings

Video Re-localization

Audio based Temporal Video Grounding

Image based Temporal Video Grounding

Sign Language Localization