判别式
计算机科学
人工智能
样品(材料)
视频质量
注释
编码(集合论)
力矩(物理)
质量(理念)
模式识别(心理学)
公制(单位)
化学
运营管理
物理
集合(抽象数据类型)
色谱法
经典力学
经济
程序设计语言
哲学
认识论
作者
Minghang Zheng,Yanjie Huang,Qing-Chao Chen,Yang Liu
出处
期刊:Proceedings of the ... AAAI Conference on Artificial Intelligence
[Association for the Advancement of Artificial Intelligence (AAAI)]
日期:2022-06-28
卷期号:36 (3): 3517-3525
被引量:52
标识
DOI:10.1609/aaai.v36i3.20263
摘要
Video moment localization aims at localizing the video segments which are most related to the given free-form natural language query. The weakly supervised setting, where only video level description is available during training, is getting more and more attention due to its lower annotation cost. Prior weakly supervised methods mainly use sliding windows to generate temporal proposals, which are independent of video content and low quality, and train the model to distinguish matched video-query pairs and unmatched ones collected from different videos, while neglecting what the model needs is to distinguish the unaligned segments within the video. In this work, we propose a novel weakly supervised solution by introducing Contrastive Negative sample Mining (CNM). Specifically, we use a learnable Gaussian mask to generate positive samples, highlighting the video frames most related to the query, and consider other frames of the video and the whole video as easy and hard negative samples respectively. We then train our network with the Intra-Video Contrastive loss to make our positive and negative samples more discriminative. Our method has two advantages: (1) Our proposal generation process with a learnable Gaussian mask is more efficient and makes our positive sample higher quality. (2) The more difficult intra-video negative samples enable our model to distinguish highly confusing scenes. Experiments on two datasets show the effectiveness of our method. Code can be found at https://github.com/minghangz/cnm.
科研通智能强力驱动
Strongly Powered by AbleSci AI