Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection

讽刺计算机科学保险丝（电气）模态（人机交互）人工智能模式情态动词自编码过程（计算）自然语言处理图像融合宏机器学习图像（数学）深度学习语言学工程类讽刺社会科学哲学化学电气工程社会学高分子化学程序设计语言操作系统

作者

Jie Wang,Yan Yang,Yongquan Jiang,Minbo Ma,Zhuyang Xie,Tianrui Li

出处

期刊：Information Fusion [Elsevier BV]
日期：2023-11-08 卷期号：103: 102132-102132 被引量：17

标识

DOI：10.1016/j.inffus.2023.102132

摘要

Sarcasm embodies a linguistic phenomenon that highlights a significant incongruity between the literal meanings of words and intended attitudes. With the proliferation of image–text content on social media, the task of multi-modal sarcasm detection (MSD) has gained considerable attention recently. Tremendous progress have been made in developing better MSD models, primarily relying on a straightforward extract-then-fuse paradigm. However, such a setting encounters two potential challenges. First, the utilization of separately pre-trained unimodal models for extracting visual and textual features frequently lacks the fundamental alignment capabilities required for effective multimodal data integration. Second, the detrimental modality gaps between vision and language make it challenging to comprehensively integrate multi-modal information solely via diverse cross-modal fusion techniques. Consequently, this poses a prominent challenge in further capturing cross-modal incongruity and improving the effectiveness of MSD. In this paper, we propose a Multi-modal Mutual Learning (MuMu) network to tackle these issues. Specifically, we initialize the MuMu network with image and text encoders from the large-scale Contrastive Language-Image Pretraining model to enhance the underlying image–text correspondence. Moreover, to improve the capability of capturing cross-modal inconsistency during the fusion process, we design an align-fuse-collaborate mechanism to align disparate modalities before fusion and enhance the collaborative modeling ability between the two modalities with mutual learning after fusion. The proposed MuMu achieves new state-of-the-art results on a public dataset, demonstrating a substantial improvement of approximately 3% to 9% in terms of accuracy, micro-F1, and macro-F1 scores.

求助该文献

最长约 10秒，即可获得该文献文件

Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection

今日热心研友