计算机科学
嵌入
判别式
图像(数学)
自然语言处理
人工智能
匹配(统计)
关系(数据库)
突出
判决
光学(聚焦)
模式
情态动词
任务(项目管理)
模态(人机交互)
样品(材料)
模式识别(心理学)
数学
数据挖掘
社会科学
统计
物理
化学
管理
色谱法
社会学
高分子化学
光学
经济
作者
Zheren Fu,Zhendong Mao,Yan Song,Yongdong Zhang
标识
DOI:10.1109/cvpr52729.2023.01455
摘要
Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities. However, previous studies only focus on capturing fragment-level relation within a sample from a particular modality, e.g., salient regions in an image or text words in a sentence, where they usually pay less attention to capturing instance-level interactions among samples and modalities, e.g., multiple images and texts. In this paper, we argue that sample relations could help learn subtle differences for hard negative instances, and thus transfer shared knowledge for infrequent samples should be promising in obtaining better holistic embeddings. Therefore, we propose a novel hierarchical relation modeling framework (HREM), which explicitly capture both fragment-and instance-level relations to learn discriminative and robust cross-modal embeddings. Extensive experiments on Flickr30K and MS-COCO show our proposed method out-performs the state-of-the-art ones by 4%-10% in terms of rSum. Our code is available at https://github.com/CrossmodalGroup/HREM.
科研通智能强力驱动
Strongly Powered by AbleSci AI