计算机科学
人工智能
视觉文字
特征(语言学)
冗余(工程)
隐藏字幕
嵌入
可视化
联营
图像检索
搜索引擎索引
情报检索
滤波器(信号处理)
模式识别(心理学)
计算机视觉
图像(数学)
语言学
操作系统
哲学
作者
Dongqing Wu,Huihui Li,Cang Gu,Hang Liu,Cuili Xu,Xinyuan Hou,Lei Guo
标识
DOI:10.1109/tmm.2023.3316077
摘要
Current image-text retrieval methods mainly utilize region features that provide object-level information to represent images, making the retrieval results more accurate and interpretable. However, there are several issues with region features, such as lack of rich contextual information, loss of object details and risk of detection redundancy. The ideal visual features in image-text retrieval should have three characteristics: object-level, semantically-rich, and language-aligned. To this end, we propose a novel visual representation framework to capture more comprehensive and powerful visual features. Specifically, since these region feature disadvantages are the grid feature advantages, we first build a two-step interaction model to explore the complex relationship between them from the spatial and semantic perspectives to integrate their complementary information, making the fused visual features both object-level and semantic-rich. Then, we design a text-integrated visual embedding module that utilizes textual information as guidance to filter redundant regions, further endowing visual features with language-aligned capabilities. Finally, we develop a multi-attention pooling module to better aggregate these enhanced visual features in a more fine-grained manner. Extensive experiments demonstrate that our proposed model achieves state-of-the-art performance on the benchmark datasets Flickr30K and MS-COCO.
科研通智能强力驱动
Strongly Powered by AbleSci AI