隐藏字幕
计算机科学
变压器
编码器
人工智能
模式
事件(粒子物理)
多模式学习
语音识别
计算机视觉
图像(数学)
电压
工程类
物理
社会学
电气工程
操作系统
量子力学
社会科学
作者
Yiwei Wei,Shuai Yuan,Meng Chen,Longbiao Wang
标识
DOI:10.1109/icassp49357.2023.10095659
摘要
Dense video captioning aims to localize multiple events from an untrimmed video and generate corresponding captions for each event. Fusing different modalities(e.g. rgb, flow, audio) via transformer structure is a promising way to improve the caption performance. However, it is challenging for the cross-modal encoder to learn multimodal interactions due to their inherent disparities of distribution. In this paper, we propose a novel transformer structure with contrastive learning to align different modalities. Specifically, to avoid the limitation of small batch size and false contrastive targets, we design an event-aligned momentum augmentation strategy to apply contrast learning for dense video captioning. The experimental result shows that our proposals outperform all existing multimodal fusion methods for dense video captioning.
科研通智能强力驱动
Strongly Powered by AbleSci AI