TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

计算机科学深度学习人工智能变压器多模式学习情感计算情绪识别深信不疑网络特征学习机器学习工程类电气工程电压

作者

Zhengdao Zhao,Yuhua Wang,Guang Shen,Yuezhu Xu,Jiayuan Zhang

出处

期刊：IEEE/ACM transactions on audio, speech, and language processing [Institute of Electrical and Electronics Engineers]
日期：2023-01-01 卷期号：31: 3771-3782 被引量：2

标识

DOI：10.1109/taslp.2023.3316458

摘要

As deep learning technology research continues to progress, artificial intelligence technology is gradually empowering various fields. To achieve a more natural human-computer interaction experience, how to accurately recognize emotional state of speech interactions has become a new research hotspot. Sequence modeling methods based on deep learning techniques have promoted the development of emotion recognition, but the mainstream methods still suffer from insufficient multimodal information interaction, difficulty in learning emotion-related features, and low recognition accuracy. In this paper, we propose a transformer-based deep-scale fusion network (TDFNet) for multimodal emotion recognition, solving the aforementioned problems. The multimodal embedding (ME) module in TDFNet uses pretrained models to alleviate the data scarcity problem by providing a priori knowledge of multimodal information to the model with the help of a large amount of unlabeled data. In addition, a mutual transformer (MT) module is introduced to learn multimodal emotional commonality and speaker-related emotional features to improve contextual emotional semantic understanding. In addition, we design a novel emotion feature learning method named the deep-scale transformer (DST), which further improves emotion recognition by aligning multimodal features and learning multiscale emotion features through GRUs with shared weights. To comparatively evaluate the performance of TDFNet, experiments are conducted with the IEMOCAP corpus under three reasonable data splitting strategies. The experimental results show that TDFNet achieves 82.08% WA and 82.57% UA in RA data splitting, which leads to 1.78% WA and 1.17% UA improvements over the previous state-of-the-art method, respectively. Benefiting from the attentively aligned mutual correlations and fine-grained emotion-related features, TDFNet successfully achieves significant improvements in multimodal emotion recognition.

求助该文献

最长约 10秒，即可获得该文献文件

TDFNet: Transformer-Based Deep-Scale Fusion Network for Multimodal Emotion Recognition

今日热心研友