计算机科学
人工智能
情态动词
变化(天文学)
相似性(几何)
班级(哲学)
一般化
机器学习
深度学习
模式识别(心理学)
图像(数学)
数学
数学分析
化学
物理
天体物理学
高分子化学
作者
Shenshen Li,Xing Xu,Yang Yang,Fumin Shen,Yijun Mo,Y. Li,Heng Tao Shen
标识
DOI:10.1145/3581783.3612244
摘要
Text-based person retrieval aims at searching for a pedestrian image from multiple candidates with textual descriptions. It is challenging due to uncertain cross-modal alignments caused by the large intra-class variations. To address the challenge, most existing approaches rely on various attention mechanisms and auxiliary information, yet still struggle with the uncertain cross-modal alignments arising from significant intra-class variation, leading to coarse retrieval results. To this end, we propose a novel framework termed Deep Cross-modal Evidential Learning (DCEL), which deploys evidential deep learning to consider the cross-modal alignment uncertainty. Our DCEL model comprises three components: (1) Bidirectional Evidential Learning, which models alignment uncertainty to measure and mitigate the influence of large intra-class variation; (2) Multi-level Semantic Alignment, which leverages a proposed Semantic Filtration module and image-text similarity distribution to facilitate cross-modal alignments; (3) Cross-modal Relation Learning, which reasons about latent correspondences between multi-level tokens of image and text. Finally, we integrate the advantages of the three proposed components to enhance the model to achieve reliable cross-modal alignments. Our DCEL method consistently outperforms more than ten state-of-the-art methods in supervised, weakly supervised, and domain generalization settings on three benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.
科研通智能强力驱动
Strongly Powered by AbleSci AI