隐藏字幕
计算机科学
情态动词
事件(粒子物理)
编码器
变压器
自然语言处理
人工智能
发电机(电路理论)
任务(项目管理)
特征(语言学)
语音识别
图像(数学)
电压
功率(物理)
化学
物理
语言学
哲学
管理
量子力学
高分子化学
经济
操作系统
作者
Jingjing Niu,Yulai Xie,Yang Zhang,Jinyu Zhang,Yanfei Zhang,Xiao Liu,Fang Ren
标识
DOI:10.1142/s021800142255014x
摘要
Multi-modal dense video captioning is a task using multiple information to detect all meaningful events and generate a textual description for each event. The existing works mainly rely on single visual or dual audio-visual modals in dense video captioning, while completely ignoring the text modal (subtitle). The text modal has a similar data structure as the video captions, which provides immediate semantic information to the content description for a video. In this paper, we propose a novel framework, called Two-Stage Cross-Modal Encoding Transformer Network (TS-CMETN), to realize the multi-modal dense video captioning task by fusing multiple features, including audio, visual, and text. First, we design a two-stage feature fusion encoder that hierarchically achieves the intra- and inter-modal information interaction. Second, we propose an anchor-free temporal event proposal module, which efficiently generates event proposals at each time step without the complex anchor calculation. Extensive experiments on the ActivityNet Captions dataset show that our proposed framework achieves high performance. Moreover, our approach can adaptively handle cases of the missing text modal. Our code and data are available at https://github.com/xieyulai/TM-CMETN .
科研通智能强力驱动
Strongly Powered by AbleSci AI