隐藏字幕
计算机科学
编码器
变压器
情态动词
人工智能
自然语言处理
语音识别
图像(数学)
量子力学
操作系统
物理
电压
化学
高分子化学
作者
Yulai Xie,Jingjing Niu,Yang Zhang,Fang Ren
标识
DOI:10.1109/tmm.2023.3307972
摘要
Dense video captioning aims to detect all events of an uncropped video and generate corresponding textual captions for each event. Multi-modal information is essential to improve the performance of this task, but the existing methods mainly rely on the single visual or dual audio-visual modal input, while completely ignoring the text modal input (subtitle). Since the text data has a similar data representation as video caption words, it is conducive to the performance improvement of video captioning. In this paper, we propose a novel framework, called the multi-stage fusion transformer network (MS-FTN), to realize multi-modal dense video captioning by fusing the text, the audio, and the visual features in stages. We present a multi-stage feature fusion encoder that first fuses audio and visual modalities at a lower level and then fuses them with a global-shared text representation at a higher level to generate a set of multi-modal complementary context features. In addition, an anchor-free event proposal module is proposed to efficiently generate a set of event proposals without the complex anchor calculation. Extensive experiments on the subsets of the ActivityNet Captions dataset show that our proposed MS-FTN achieves superior performance and efficient computation. Moreover, the ablation studies demonstrate that the global-shared text representation is more suitable for multi-modal dense video captioning. Our code and data are available at https://github.com/xieyulai/GS-MS-FTN .
科研通智能强力驱动
Strongly Powered by AbleSci AI