计算机科学
参照物
代表(政治)
表达式(计算机科学)
粒度
人工智能
水准点(测量)
理解力
自然语言
自然语言处理
模式识别(心理学)
程序设计语言
语言学
政治
哲学
大地测量学
法学
地理
政治学
作者
Liuwu Li,Yuqi Bu,Yi Cai
标识
DOI:10.1145/3474085.3475629
摘要
In this paper, we propose a one-stage approach to improve referring expression comprehension (REC) which aims at grounding the referent according to a natural language expression. We observe that humans understand referring expressions through a fine-to-coarse bottom-up way, and bidirectionally obtain vision-language information between image and text. Inspired by this, we define the language granularity and the vision granularity. Otherwise, existing methods do not follow the mentioned way of human understanding in referring expression. Motivated by our observation and to address the limitations of existing methods, we propose a bottom-up and bidirectional alignment (BBA) framework. Our method constructs the cross-modal alignment starting from fine-grained representation to coarse-grained representation and bidirectionally obtains vision-language information between image and text. Based on the structure of BBA, we further propose a progressive visual attribute decomposing approach to decompose visual proposals into several independent spaces to enhance the bottom-up alignment framework. Experiments on five benchmark datasets of RefCOCO, RefCOCO+, ReferItGame, RefCOCOg and Flick30K show that our approach obtains +2.16%, +4.47%, +2.85%, +3.44%, and +2.91% improvements over the one-stage SOTA approaches, which validates the effectiveness of our approach.
科研通智能强力驱动
Strongly Powered by AbleSci AI