计算机科学
代表(政治)
特征学习
钥匙(锁)
编码器
人工智能
特征(语言学)
水准点(测量)
噪音(视频)
子空间拓扑
依赖关系(UML)
隐藏字幕
自然语言处理
机器学习
图像(数学)
语言学
哲学
计算机安全
大地测量学
政治
政治学
法学
地理
操作系统
作者
Jiancheng Pan,Qing Ma,Cong Bai
标识
DOI:10.1145/3581783.3612374
摘要
This paper presents a prior instruction representation framework (PIR) for remote sensing image-text retrieval, aimed at remote sensing vision-language understanding tasks to solve the semantic noise problem. Our highlight is the proposal of a paradigm that draws on prior knowledge to instruct adaptive learning of vision and text representations. Concretely, two progressive attention encoder (PAE) structures, Spatial-PAE and Temporal-PAE, are proposed to perform long-range dependency modeling to enhance key feature representation. In vision representation, Vision Instruction Representation (VIR) based on Spatial-PAE exploits the prior-guided knowledge of the remote sensing scene recognition by building a belief matrix to select key features for reducing the impact of semantic noise. In text representation, Language Cycle Attention (LCA) based on Temporal-PAE uses the previous time step to cyclically activate the current time step to enhance text representation capability. A cluster-wise affiliation loss is proposed to constrain the inter-classes and to reduce the semantic confusion zones in the common subspace. Comprehensive experiments demonstrate that using prior knowledge instruction could enhance vision and text representations and could outperform the state-of-the-art methods on two benchmark datasets, RSICD and RSITMD. Codes are available at https://github.com/Zjut-MultimediaPlus/PIR-pytorch.
科研通智能强力驱动
Strongly Powered by AbleSci AI