计算机科学
判别式
人工智能
特征学习
力矩(物理)
特征(语言学)
杠杆(统计)
特征向量
模式识别(心理学)
语义鸿沟
嵌入
计算机视觉
图像检索
图像(数学)
经典力学
语言学
物理
哲学
作者
Zheng Wang,Jingjing Chen,Yu‐Gang Jiang
标识
DOI:10.1145/3474085.3475278
摘要
Video moment retrieval aims to localize the most relevant video moment given the text query. Weakly supervised approaches leverage video-text pairs only for training, without temporal annotations. Most current methods align the proposed video moment and the text in a joint embedding space. However, in lack of temporal annotations, the semantic gap between these two modalities makes it predominant to learn joint feature representation for most methods, with less emphasis on learning visual feature representation. This paper aims to improve the visual feature representation with supervisions in the visual domain, obtaining discriminative visual features for cross-modal learning. Based on the observation that relevant video moments (i.e., share similar activities) from different videos are commonly described by similar sentences; hence the visual features of these relevant video moments should also be similar despite that they come from different videos. Therefore, to obtain more discriminative and robust visual features for video moment retrieval, we propose to align the visual features of relevant video moments from different videos that co-occurred in the same training batch. Besides, a contrastive learning approach is introduced for learning the moment-level alignment of these videos. Through extensive experiments, we demonstrate that the proposed visual co-occurrence alignment learning method outperforms the cross-modal alignment learning counterpart and achieves promising results for video moment retrieval.
科研通智能强力驱动
Strongly Powered by AbleSci AI