

Publisher = "Association for Computational Linguistics",Ībstract = "Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence.
#Ivideo reading mods#
Cite (Informal): Relation-aware Video Reading Comprehension for Temporal Language Grounding (Gao et al., EMNLP 2021) Copy Citation: BibTeX Markdown MODS XML Endnote More options… PDF: Video: Code Huntersxsx/RaNet Data ActivityNet Captions, Charades, = "Relation-aware Video Reading Comprehension for Temporal Language Grounding",īooktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",Īddress = "Online and Punta Cana, Dominican Republic", Association for Computational Linguistics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3978–3988, Online and Punta Cana, Dominican Republic. Relation-aware Video Reading Comprehension for Temporal Language Grounding. Anthology ID: 2021.emnlp-main.324 Volume: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Month: November Year: 2021 Address: Online and Punta Cana, Dominican Republic Venue: EMNLP SIG: Publisher: Association for Computational Linguistics Note: Pages: 3978–3988 Language: URL: DOI: 10.18653/v1/2021.emnlp-main.324 Bibkey: gao-etal-2021-relation Cite (ACL): Jialin Gao, Xin Sun, Mengmeng Xu, Xi Zhou, and Bernard Ghanem. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. Previous methods treat it either as a boundary regression task or a span extraction task.

Abstract Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence.
