讽刺
情态动词
计算机科学
模态(人机交互)
背景(考古学)
任务(项目管理)
人工智能
模式
机器学习
语音识别
工程类
语言学
讽刺
化学
系统工程
高分子化学
古生物学
社会科学
哲学
社会学
生物
标识
DOI:10.1145/3652583.3658015
摘要
With the continuous emergence of various types of social media, which people often use to express their emotions in daily life, the multi-modal sarcasm detection (MSD) task has attracted more and more attention. However, due to the unique nature of sarcasm itself, there are still two main challenges on the way to achieving robust MSD: 1) existing mainstream methods often fail to take into account the problem of multi-modal weak correlation, thus ignoring the important sarcasm information of the uni-modal itself; 2) inefficiency in modeling cross-modal interactions in unaligned multi-modal data. Therefore, this paper proposes a multi-task jointly trained aggregation network (MTAN), which mainly adopts networks adapted to different modalities according to different modality processing tasks. Specifically, we design a multi-task CLIP framework that includes an uni-modal text task, an uni-modal image task, and a cross-modal interaction task, which can utilize sentiment cues from multiple tasks for multi-modal sarcasm detection. In addition, we design a global-local cross-modal interaction learning method that utilizes discourse-level representations from each modality as the global multi-modal context to interact with local uni-modal features, which not only avoids the secondary scaling cost of previous local-local cross-modal interaction methods but also allows the global multi-modal context and local uni-modal features to be mutually reinforcing and progressively improved through multi-layer superposition. After extensive experimental results and in-depth analysis, our model achieves state-of-the-art performance in multi-modal sarcasm detection.
科研通智能强力驱动
Strongly Powered by AbleSci AI