计算机科学
排名(信息检索)
情态动词
人工智能
图像(数学)
任务(项目管理)
自然语言处理
构造(python库)
图像检索
情报检索
模式
模式识别(心理学)
化学
管理
高分子化学
经济
程序设计语言
社会科学
社会学
作者
Xiaodan Wang,Lei Li,Zhixu Li,Xuwu Wang,Xiangru Zhu,Chengyu Wang,Jun Huang,Yanghua Xiao
标识
DOI:10.1145/3539597.3570481
摘要
Image-text retrieval is a challenging cross-modal task that arouses much attention. While the traditional methods cannot break down the barriers between different modalities, Vision-Language Pre-trained (VLP) models greatly improve image-text retrieval performance based on massive image-text pairs. Nonetheless, the VLP-based methods are still prone to produce retrieval results that cannot be cross-modal aligned with entities. Recent efforts try to fix this problem at the pre-training stage, which is not only expensive but also unpractical due to the unavailable of full datasets. In this paper, we novelly propose a lightweight and practical approach to align cross-modal entities for image-text retrieval upon VLP models only at the fine-tuning and re-ranking stages. We employ external knowledge and tools to construct extra fine-grained image-text pairs, and then emphasize cross-modal entity alignment through contrastive learning and entity-level mask modeling in fine-tuning. Besides, two re-ranking strategies are proposed, including one specially designed for zero-shot scenarios. Extensive experiments with several VLP models on multiple Chinese and English datasets show that our approach achieves state-of-the-art results in nearly all settings.
科研通智能强力驱动
Strongly Powered by AbleSci AI