计算机科学
模式
多模式学习
人工智能
情绪识别
特征学习
水准点(测量)
面部表情
变压器
语音识别
机器学习
模式识别(心理学)
社会学
电压
物理
地理
量子力学
社会科学
大地测量学
作者
Shamane Siriwardhana,Tharindu Kaluarachchi,Mark Billinghurst,Suranga Nanayakkara
出处
期刊:IEEE Access
[Institute of Electrical and Electronics Engineers]
日期:2020-01-01
卷期号:8: 176274-176285
被引量:73
标识
DOI:10.1109/access.2020.3026823
摘要
Emotion Recognition is a challenging research area given its complex nature, and humans express emotional cues across various modalities such as language, facial expressions, and speech. Representation and fusion of features are the most crucial tasks in multimodal emotion recognition research. Self Supervised Learning (SSL) has become a prominent and influential research direction in representation learning, where researchers have access to pre-trained SSL models that represent different data modalities. For the first time in the literature, we represent three input modalities of text, audio (speech), and vision with features extracted from independently pre-trained SSL models in this paper. Given the high dimensional nature of SSL features, we introduce a novel Transformers and Attention-based fusion mechanism that can combine multimodal SSL features and achieve state-of-the-art results for the task of multimodal emotion recognition. We benchmark and evaluate our work to show that our model is robust and outperforms the state-of-the-art models on four datasets.
科研通智能强力驱动
Strongly Powered by AbleSci AI