计算机科学
人工智能
概括性
匹配(统计)
光学(聚焦)
情态动词
图像(数学)
自然语言处理
空格(标点符号)
特征(语言学)
模式识别(心理学)
语言学
数学
心理学
哲学
物理
化学
高分子化学
光学
操作系统
心理治疗师
统计
作者
Xinfeng Dong,Longfei Han,Dingwen Zhang,Li Liu,Junwei Han,Huaxiang Zhang
标识
DOI:10.1145/3581783.3612103
摘要
Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have unsatisfactory performance under the weak alignment scenario, which assumes that the text contains more abstract information, and the number of entities in the text is always fewer than objects in image. This is the first time, from our knowledge, to solve the image-text matching problem from the perspective of information difference with weak alignment. In order to both narrow the cross-modal heterogeneity gap and balance the information discrepancy, we proposed an imagination network to enrich the text modality based on pre-trained framework, which is helpful for image-text matching. The imagination network utilizes reinforcement learning to enhance the semantic information for text modality, and an action refinement strategy is designed to constrain the freedom and divergence of imagination. The experiment results show the superiority and generality of the proposed framework based on two pre-trained models, CLIP and BLIP on two most frequently-used datasets MSCOCO and Flickr30K.
科研通智能强力驱动
Strongly Powered by AbleSci AI