Multimodal Reaction: Information Modulation for Cross-Modal Representation Learning

计算机科学嵌入人工智能机器学习情态动词滤波器（信号处理）代表（政治）过程（计算）计算机视觉政治学政治操作系统化学高分子化学法学

作者

Ying Zeng,Sijie Mai,Wenjun Yan,Haifeng Hu

出处

期刊：IEEE Transactions on Multimedia [Institute of Electrical and Electronics Engineers]
日期：2023-07-07 卷期号：26: 2178-2191 被引量：6

标识

DOI：10.1109/tmm.2023.3293335

摘要

In multimodal machine learning, proper handling of cross-modal information is essential for obtaining an ideal joint embedding. Despite the progress made by recent fusion strategies, we hold that before the fusion stage, the unimodal representation inevitably contains noise that may hinder the correct learning of cross-modal dynamics and affect multimodal fusion. It is worthwhile to investigate how the information is being utilized and how to make the full use of it. Rethinking the process of leveraging multiple modalities for the joint embedding, multimodal learning can be regarded as a chemical reaction process and two steps may benefit learning: 1) purification to filter impurity, and 2) catalyst to facilitate learning. In this paper, we propose a Multimodal Information Modulation (MIM) learning framework to modulate the contribution and utilization of the cross-modal information, which identifies and handles the ‘impurity’ and ‘catalyst’ in multimodal learning. Specifically, a Unimodal Purification Network (UPN) is proposed to identify and explicitly filter out the impurity within each modality before fusion, which reduces the possibility of learning incorrect cross-modal dynamics. Besides, based on the intuition that useful information has the potential in the guidance of model updating, it plays a role to facilitate learning, which is achieved by the design of the Knowledge Guidance Scheme (KGS) considering both the intra- and inter-modal scenarios. Different to a majority of works that emphasize the role of useful information in the fusion and inference stage, KGS considers its potential role in assisting the representation learning of weaker components. Besides, it fully considers the modality dominance problem and sample variations for optimization. In short, MIM manages to modulate the useless/useful information to minimize/emphasize their contribution. Experimental results verify the effectiveness of the proposed method. The codes are available at https://github.com/zengy268/MIM .

求助该文献

最长约 10秒，即可获得该文献文件

Multimodal Reaction: Information Modulation for Cross-Modal Representation Learning

今日热心研友