安全性令牌
计算机科学
情态动词
嵌入
可视化
匹配(统计)
人工智能
代表(政治)
变压器
构造(python库)
相似性(几何)
自然语言处理
块(置换群论)
模式识别(心理学)
情报检索
图像(数学)
几何学
程序设计语言
法学
高分子化学
政治学
政治
电压
量子力学
数学
计算机安全
物理
统计
化学
作者
Chen-Wei Xie,Jianmin Wu,Yun Zheng,Pan Pan,Xian‐Sheng Hua
标识
DOI:10.1145/3503161.3548107
摘要
Cross-modal retrieval has achieved significant progress in recent years with the help of token embeddings interaction methods. Most existing methods first extract embedding for each token of input image and text, then feed the token-level embeddings into a multi-modal transformer to learn a joint representation, this joint representation can be used to predict matching score between input image and text. However, these methods don't explicitly supervise the alignment between visual and textual tokens. In this paper, we propose a novel Token Embeddings AlignMent (TEAM) block, it first explicitly aligns visual tokens and textual tokens, then produces token-level matching scores to measure fine-grained similarity between input image and text. TEAM achieves new state-of-the-art performance on commonly used cross-modal retrieval benchmarks. Moreover, TEAM is interpretable and we provide visualization experiments to show how it works. At last, we construct a new billion-scale vision-language pre-training dataset in Chinese, which is the largest Chinese vision-language pre-training dataset so far. After pre-training on this dataset, our framework also achieves state-of-the-art performance on Chinese cross-modal retrieval benchmarks.
科研通智能强力驱动
Strongly Powered by AbleSci AI