计算机科学
模式
人工智能
情绪分析
多模态
对比度(视觉)
模态(人机交互)
变压器
情态动词
图像融合
多模式学习
自然语言处理
机器学习
图像(数学)
电压
社会科学
化学
物理
量子力学
社会学
万维网
高分子化学
作者
Daoming Zong,Chaoyue Ding,Baoxiang Li,Jiakui Li,Kunlei Zheng,Qunyan Zhou
标识
DOI:10.1145/3581783.3611974
摘要
Multimodal Sentiment Analysis (MSA) is a popular research topic aimed at utilizing multimodal signals for understanding human emotions. The primary approach to solving this task is to develop complex fusion techniques. However, the heterogeneity and unaligned nature between modalities pose significant challenges to fusion. Additionally, existing methods lack consideration for the efficiency of modal fusion. To tackle these issues, we propose AcFormer, which contains two core ingredients: i) contrastive learning within and across modalities to explicitly align different modality streams before fusion; and ii) pivot attention for multimodal interaction/fusion. The former encourages positive triplets of image-audio-text to have similar representations in contrast to negative ones. The latter introduces attention pivots that can serve as cross-modal information bridges and limit cross-modal attention to a certain number of fusion pivot tokens. We evaluate AcFormer on multiple MSA tasks, including multimodal emotion recognition, humor detection, and sarcasm detection. Empirical evidence shows that AcFormer achieves the optimal performance with minimal computation cost compared to previous state-of-the-art methods. Our code is publicly available at https://github.com/dingchaoyue/AcFormer.
科研通智能强力驱动
Strongly Powered by AbleSci AI