计算机科学
模态(人机交互)
代表(政治)
透视图(图形)
人工智能
语音识别
模式
多模式学习
特征学习
卷积(计算机科学)
频道(广播)
循环神经网络
自然语言处理
人工神经网络
社会学
政治
法学
社会科学
计算机网络
政治学
作者
Sijie Mai,Songlong Xing,Haifeng Hu
出处
期刊:IEEE/ACM transactions on audio, speech, and language processing
[Institute of Electrical and Electronics Engineers]
日期:2021-01-01
卷期号:29: 1424-1437
被引量:28
标识
DOI:10.1109/taslp.2021.3068598
摘要
The emotion of human is always expressed in a multimodal perspective. Analyzing multimodal human sentiment remains challenging due to the difficulties of the interpretation in inter-modality dynamics. Mainstream multimodal learning architectures tend to design various fusion strategies to learn inter-modality interactions, which barely consider the fact that the language modality is far more important than the acoustic and visual modalities. In contrast, we learn inter-modality dynamics in a different perspective via acoustic- and visual-LSTMs where language features play dominant role. Specifically, inside each LSTM variant, a well-designed gating mechanism is introduced to enhance the language representation via the corresponding auxiliary modality. Furthermore, in the unimodal representation learning stage, instead of using RNNs, we introduce `channel-aware' temporal convolution network to extract high-level representations for each modality to explore both temporal and channel-wise interdependencies. Extensive experiments demonstrate that our approach achieves very competitive performance compared to the state-of-the-art methods on three widely-used benchmarks for multimodal sentiment analysis and emotion recognition.
科研通智能强力驱动
Strongly Powered by AbleSci AI