期刊:IEEE Transactions on Industrial Informatics [Institute of Electrical and Electronics Engineers] 日期:2024-04-25卷期号:20 (7): 9814-9824被引量:1
标识
DOI:10.1109/tii.2024.3388670
摘要
With the popularity of short videos, analyzing human emotions is crucial for understanding individual attitudes and guiding social public opinions. Consequently, multimodal sentiment analysis (MSA) has garnered significant attention in the field of human–computer interaction. The main challenge of MSA is to explore a high-quality multimodal fusion framework, as multiple modalities contribute inconsistently to sentiment prediction. However, most of the existing methods assume equal importance among different modalities, resulting in inadequate expression of the main modality. In addition, auxiliary modalities often contain redundant information, which hinders the multimodal fusion process. Therefore, we propose the multichannel cross-modal fusion network (MCFNet) to promote the multimodal fusion procedure by constructing a multichannel various modality fusion framework comprising three channels: obtaining multimodal representation through the first channel; eliminating information redundancy from auxiliary modalities via the second channel; and enhancing significance attributed to the main modality adopting the third channel. Subsequently, we design a multichannel information fusion gate to integrate feature representations from these three channels for downstream sentiment classification tasks. Numerous experiments on three benchmark datasets, CMU-multimodal opinion sentiment intensity (MOSI), CMU-multimodal opinion sentiment and emotion intensity (MOSEI), and Twitter2019, show that the MCFNet has made a significant progress compared to recent state-of-the-art methods.