计算机科学
嵌入
图像检索
水准点(测量)
特征(语言学)
特征向量
空格(标点符号)
人工智能
语言模型
图像(数学)
情报检索
模式识别(心理学)
哲学
操作系统
语言学
地理
大地测量学
作者
Hongguang Zhu,Chunjie Zhang,Yunchao Wei,Shujuan Huang,Yao Zhao
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology
[Institute of Electrical and Electronics Engineers]
日期:2023-03-06
卷期号:33 (10): 6131-6143
被引量:7
标识
DOI:10.1109/tcsvt.2023.3253548
摘要
Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used $83\times $ image-text pairs than ours, our approach not only surpasses in performance but also accelerates $3\times $ on retrieval time. Codes and pre-trained models are available at https://github.com/KevinLight831/ESA .
科研通智能强力驱动
Strongly Powered by AbleSci AI