Local-enhanced Interaction for Temporal Moment Localization
计算机科学
力矩(物理)
人工智能
计算机视觉
作者
Guoqiang Liang,Ji Shiyu,Yanning Zhang
出处
期刊:International Conference on Multimedia Retrieval日期:2021-08-24卷期号:: 201-209
标识
DOI:10.1145/3460426.3463616
摘要
Temporal moment localization via language aims to localize a video span in an untrimmed video which best matches the given natural language query. In most previous works, they try to match the whole query feature with multiple moment proposals, or match a global video embedding with phrase or word level query features. However, these coarse interaction models will become insufficient when the query-video contains more complex relationship. To address this issue, we propose a multi-branches interaction model for temporal moment localization. Specifically, the query sentence and video are encoded into multiple feature embeddings over several semantic sub-spaces. Then, each phrase embedding filters on a video feature to generate an attention sequence, which is used to re-weight the video features. Moreover, a dynamic pointer decoder is developed to iteratively regress the temporal boundary, which can prevent our model from falling into a local optimum. To validate the proposed method, we have conducted extensive experiments on two popular benchmark datasets Charade-STA and TACoS. The experimental performance surpasses other state-of-the-arts methods, which demonstrates the effectiveness of our proposed model.