计算机科学
编码器
人工智能
变压器
稳健性(进化)
面部表情
模式
语音识别
模式识别(心理学)
计算机视觉
工程类
基因
操作系统
生物化学
电气工程
社会学
社会科学
电压
化学
作者
Xiaoqin Zhang,Min Li,Sheng Lin,Hang Xu,Guobao Xiao
标识
DOI:10.1109/tcsvt.2023.3312858
摘要
Dynamic expression recognition in the wild is a challenging task due to various obstacles, including low light condition, non-positive face, and face occlusion. Purely vision-based approaches may not suffice to accurately capture the complexity of human emotions. To address this issue, we propose a Transformer-based Multimodal Emotional Perception (T-MEP) framework capable of effectively extracting multimodal information and achieving significant augmentation. Specifically, we design three transformer-based encoders to extract modality-specific features from audio, image, and text sequences, respectively. Each encoder is carefully designed to maximize its adaptation to the corresponding modality. In addition, we design a transformer-based multimodal information fusion module to model cross-modal representation among these modalities. The unique combination of self-attention and cross-attention in this module enhances the robustness of output-integrated features in encoding emotion. By mapping the information from audio and textual features to the latent space of visual features, this module aligns the semantics of the three modalities for cross-modal information augmentation. Finally, we evaluate our method on three popular datasets (MAFW, DFEW, and AFEW) through extensive experiments, which demonstrate its state-of-the-art performance. This research offers a promising direction for future studies to improve emotion recognition accuracy by exploiting the power of multimodal features.
科研通智能强力驱动
Strongly Powered by AbleSci AI