ESA: External Space Attention Aggregation for Image-Text Retrieval

计算机科学 嵌入 图像检索 水准点(测量) 特征(语言学) 特征向量 空格(标点符号) 人工智能 语言模型 图像(数学) 情报检索 模式识别(心理学) 哲学 操作系统 语言学 地理 大地测量学
作者
Hongguang Zhu,Chunjie Zhang,Yunchao Wei,Shujuan Huang,Yao Zhao
出处
期刊:IEEE Transactions on Circuits and Systems for Video Technology [Institute of Electrical and Electronics Engineers]
卷期号:33 (10): 6131-6143 被引量:7
标识
DOI:10.1109/tcsvt.2023.3253548
摘要

Due to the large gap between vision and language modalities, effective and efficient image-text retrieval is still an unsolved problem. Recent progress devotes to unilaterally pursuing retrieval accuracy by either entangled image-text interaction or large-scale vision-language pre-training in a brute force way. However, the former often leads to unacceptable retrieval time explosion when deploying on large-scale databases. The latter heavily relies on the extra corpus to learn better alignment in the feature space while obscuring the contribution of the network architecture. In this work, we aim to investigate a trade-off to balance effectiveness and efficiency. To this end, on the premise of efficient retrieval, we propose the plug-and-play External Space attention Aggregation (ESA) module to enable element-wise fusion of modal features under spatial dimensional attention. Based on flexible spatial awareness, we further propose the Self-Expanding triplet Loss (SEL) to expand the representation space of samples and optimize the alignment of embedding space. The extensive experiments demonstrate the effectiveness of our method on two benchmark datasets. With identical visual and textual backbones, our single model has outperformed the ensemble modal of similar methods, and our ensemble model can further expand the advantage. Meanwhile, compared with the vision-language pre-training embedding-base method that used $83\times $ image-text pairs than ours, our approach not only surpasses in performance but also accelerates $3\times $ on retrieval time. Codes and pre-trained models are available at https://github.com/KevinLight831/ESA .
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
1秒前
汉堡包应助大音响贴贴采纳,获得10
1秒前
AMAME12完成签到,获得积分20
1秒前
阳光正好发布了新的文献求助10
2秒前
橘柚完成签到,获得积分10
2秒前
橘子完成签到,获得积分10
2秒前
充电宝应助小皮大大采纳,获得10
3秒前
小巧的凌兰完成签到,获得积分10
3秒前
甘愿完成签到,获得积分10
3秒前
162完成签到,获得积分10
3秒前
weishen发布了新的文献求助10
3秒前
龍fei完成签到,获得积分10
4秒前
4秒前
wangshuyan发布了新的文献求助10
4秒前
梦汐moxi发布了新的文献求助10
4秒前
王弘化应助炙热的雪糕采纳,获得10
5秒前
愉快的土豆完成签到,获得积分10
5秒前
GJJ完成签到 ,获得积分10
5秒前
ding应助yuyuyuyuyuyuyu采纳,获得10
5秒前
6秒前
tuanheqi应助沭怷采纳,获得200
6秒前
领导范儿应助小巧的凌兰采纳,获得10
6秒前
aaaa完成签到,获得积分10
7秒前
蒲公英发布了新的文献求助20
7秒前
封夏山完成签到,获得积分10
8秒前
乐乐应助ruby30采纳,获得10
9秒前
9秒前
xxs完成签到,获得积分10
9秒前
wuqianru发布了新的文献求助10
9秒前
9秒前
10秒前
10秒前
若雪幽梦完成签到,获得积分10
10秒前
10秒前
小丸子呀发布了新的文献求助10
11秒前
11秒前
彭于晏应助安静明杰采纳,获得10
13秒前
秃驴完成签到,获得积分10
13秒前
13秒前
14秒前
高分求助中
Licensing Deals in Pharmaceuticals 2019-2024 3000
Cognitive Paradigms in Knowledge Organisation 2000
Effect of reactor temperature on FCC yield 2000
Introduction to Spectroscopic Ellipsometry of Thin Film Materials Instrumentation, Data Analysis, and Applications 1800
How Maoism Was Made: Reconstructing China, 1949-1965 800
Barge Mooring (Oilfield Seamanship Series Volume 6) 600
Medical technology industry in China 600
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3312954
求助须知:如何正确求助?哪些是违规求助? 2945312
关于积分的说明 8524570
捐赠科研通 2621088
什么是DOI,文献DOI怎么找? 1433321
科研通“疑难数据库(出版商)”最低求助积分说明 664936
邀请新用户注册赠送积分活动 650325