Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer Network
计算机科学
判别式
变压器
人工智能
情绪识别
语义空间
人机交互
语音识别
量子力学
物理
电压
作者
Qi Fan,Xiaoshan Yang,Changsheng Xu
标识
DOI:10.1145/3474085.3475647
摘要
Recognizing human emotions from videos has attracted significant attention in numerous computer vision and multimedia applications, such as human-computer interaction and health care. It aims to understand the emotional response of humans, where candidate emotion categories are generally defined by specific psychological theories. However, with the development of psychological theories, emotion categories become increasingly diverse and fine-grained, samples are also increasingly difficult to collect. In this paper, we investigate a new task of zero-shot video emotion recognition, which aims to recognize rare unseen emotions. Specifically, we propose a novel multimodal protagonist-aware transformer network, which is composed of two branches: one is equipped with a novel dynamic emotional attention mechanism and a visual transformer to learn better visual representations; the other is an acoustic transformer for learning discriminative acoustic representations. We manage to align the visual and acoustic representations with semantic embeddings of fine-grained emotion labels through jointly mapping them into a common space under a noise contrastive estimation objective. Extensive experimental results on three datasets demonstrate the effectiveness of the proposed method.