隐藏字幕
计算机科学
编码器
人工智能
杠杆(统计)
自然语言处理
语义特征
冗余(工程)
特征(语言学)
语义鸿沟
情报检索
图像(数学)
图像检索
操作系统
语言学
哲学
作者
Tian-Zi Niu,Shan-Shan Dong,Zhen-Duo Chen,Xin Luo,Shanqing Guo,Zi Huang,Xin-Shun Xu
摘要
Video captioning aims to automatically describe a video clip with informative sentences. At present, deep learning-based models have become the mainstream for this task and achieved competitive results on public datasets. Usually, these methods leverage different types of features to generate sentences, e.g., semantic information, 2D or 3D features. However, some methods only treat semantic information as a complement of visual representations and cannot fully exploit it; some of them ignore the relationship between different types of features. In addition, most of them select multiple frames of a video with an equally spaced sampling scheme, resulting in much redundant information. To address these issues, we present a novel video-captioning framework, Semantic Enhanced video captioning with Multi-feature Fusion, SEMF for short. It optimizes the use of different types of features from three aspects. First, a semantic encoder is designed to enhance meaningful semantic features through a semantic dictionary to boost performance. Second, a discrete selection module pays attention to important features and obtains different contexts at different steps to reduce feature redundancy. Finally, a multi-feature fusion module uses a novel relation-aware attention mechanism to separate the common and complementary components of different features to provide more effective visual features for the next step. Moreover, the entire framework can be trained in an end-to-end manner. Extensive experiments are conducted on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets. The results demonstrate that SEMF is able to achieve state-of-the-art results.
科研通智能强力驱动
Strongly Powered by AbleSci AI