计算机科学
图像检索
判别式
图像(数学)
一致性(知识库)
水准点(测量)
图像自动标注
人工智能
视觉文字
光学(聚焦)
生成语法
模态(人机交互)
情报检索
特征(语言学)
模式识别(心理学)
地理
哲学
物理
光学
语言学
大地测量学
作者
Feifei Zhang,Mingliang Xu,Changsheng Xu
摘要
Composing Text and Image to Image Retrieval ( CTI-IR ) is an emerging task in computer vision, which allows retrieving images relevant to a query image with text describing desired modifications to the query image. Most conventional cross-modal retrieval approaches usually take one modality data as the query to retrieve relevant data of another modality. Different from the existing methods, in this article, we propose an end-to-end trainable network for simultaneous image generation and CTI-IR . The proposed model is based on Generative Adversarial Network (GAN) and enjoys several merits. First, it can learn a generative and discriminative feature for the query (a query image with text description) by jointly training a generative model and a retrieval model. Second, our model can automatically manipulate the visual features of the reference image in terms of the text description by the adversarial learning between the synthesized image and target image. Third, global-local collaborative discriminators and attention-based generators are exploited, allowing our approach to focus on both the global and local differences between the query image and the target image. As a result, the semantic consistency and fine-grained details of the generated images can be better enhanced in our model. The generated image can also be used to interpret and empower our retrieval model. Quantitative and qualitative evaluations on three benchmark datasets demonstrate that the proposed algorithm performs favorably against state-of-the-art methods.
科研通智能强力驱动
Strongly Powered by AbleSci AI