计算机科学
隐藏字幕
事件(粒子物理)
水准点(测量)
利用
保险丝(电气)
编码
情态动词
人工智能
过程(计算)
机器学习
自然语言处理
图像(数学)
电气工程
大地测量学
物理
工程类
操作系统
基因
量子力学
化学
生物化学
高分子化学
地理
计算机安全
作者
Zhi Yong Chang,Dexin Zhao,Huilin Chen,Jingdan Li,Pengfei Liu
标识
DOI:10.1016/j.neunet.2021.11.017
摘要
Dense video captioning aims to automatically describe several events that occur in a given video, which most state-of-the-art models accomplish by locating and describing multiple events in an untrimmed video. Despite much progress in this area, most current approaches only encode visual features in the event location phase and they neglect the relationships between events, which may degrade the consistency of the description in the identical video. Thus, in the present study, we attempted to exploit visual-audio cues to generate event proposals and enhance event-level representations by capturing their temporal and semantic relationships. Furthermore, to compensate for the major limitation of not fully utilizing multimodal information in the description process, we developed an attention-gating mechanism that dynamically fuses and regulates the multi-modal information. In summary, we propose an event-centric multi-modal fusion approach for dense video captioning (EMVC) to capture the relationships between events and effectively fuse multi-modal information. We conducted comprehensive experiments to evaluate the performance of EMVC based on the benchmark ActivityNet Caption and YouCook2 data sets. The experimental results showed that our model achieved impressive performance compared with state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI