一致性(知识库)
情绪识别
计算机科学
认知心理学
自然语言处理
心理学
人工智能
作者
Bingzhi Chen,Qi Zhi Cao,Mi-Xiao Hou,Zheng Zhang,Yao Lu,David Zhang
出处
期刊:IEEE/ACM transactions on audio, speech, and language processing
[Institute of Electrical and Electronics Engineers]
日期:2021-01-01
卷期号:29: 3592-3603
被引量:14
标识
DOI:10.1109/taslp.2021.3129331
摘要
Automated multimodal emotion recognition has become an emerging but challenging research topic in the fields of affective learning and sentiment analysis. The existing works mainly focus on developing multimodal fusion strategies to incorporate different emotion-related features. However, they fail to explore the inherent contextual consistency to reconcile the emotional information across modalities. In this paper, we propose a novel Time and Semantic Interaction Network (TSIN), which concurrently incorporates the advantages of temporal and semantic consistency into the multimodal emotion recognition task. Specifically, a well-designed Speech and Text Embedding (STE) module is devoted to formulating the initial embedding spaces by respectively building the modality-specific representations of speech and text. Instead of separately learning or directly fusing the acoustic and textual features, we propose a well-defined Time and Semantic Interaction (TSI) module to conduct the emotional parsing and sentiment refining by performing the fine-grained temporal alignment and cross-modal semantic interaction. Benefitting from temporal and semantic consistency constraints, both speech-text embeddings can be interactively optimized and fine-tuned in the learning process. In this way, the learnt acoustics and textual features can jointly and efficiently predict the final emotional state. Extensive experiments on the IEMOCAP dataset demonstrate the superiorities of our TSIN framework in comparison with state-of-the-art baselines.
科研通智能强力驱动
Strongly Powered by AbleSci AI