计算机科学
人工智能
嵌入
判决
自然语言处理
情态动词
树(集合论)
特征(语言学)
树形结构
语义学(计算机科学)
背景(考古学)
视觉文字
情报检索
图像检索
模式识别(心理学)
图像(数学)
数据结构
语言学
数学
古生物学
哲学
数学分析
化学
高分子化学
程序设计语言
生物
作者
Xuri Ge,Fuhai Chen,Joemon M. Jose,Zhilong Ji,Zhongqin Wu,Xiao Liu
标识
DOI:10.1145/3474085.3475634
摘要
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog → play → ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities.
科研通智能强力驱动
Strongly Powered by AbleSci AI