计算机科学
判别式
人工智能
成对比较
词(群论)
匹配(统计)
自然语言处理
图像(数学)
语义学(计算机科学)
情态动词
代表(政治)
情报检索
模式识别(心理学)
关系(数据库)
数据挖掘
数学
几何学
统计
政治
化学
高分子化学
程序设计语言
法学
政治学
作者
Tao Yao,Yiru Li,Ying Li,Yingying Zhu,Gang Wang,Jun Yue
摘要
Image-text matching plays an important role in solving the problem of cross-modal information processing. Since there are nonnegligible semantic differences between heterogenous pairwise data, a crucial challenge is how to learn a unified representation. Existing methods mainly rely on the alignment between regional image features and corresponding entity words. However, the regional features in the image are often more concerned with the foreground entity information, and the attribute information of the entities and the relational information are ignored. How to effectively integrate entity-attribute alignment and relationship alignment has not been fully studied. Therefore, we propose a Cross-Modal Semantically Augmented Network for Image-Text Matching (CMSAN), which combines the relationships between entities in the image with the semantics of relational words in the text. CMSAN (1) proposes an adaptive word-type prediction model that classifies the words into four types, i.e., entity word, attribute word, relation word, and unnecessary word. It can align different image features at multiple levels. CMSAN (2) designs a sophisticated relationship alignment module and an entity-attribute alignment module that maximizes the exploitation of the semantic information, which enables the model to have more discriminative power and further improves the matching accuracy.
科研通智能强力驱动
Strongly Powered by AbleSci AI