作者
Yiyi Zhou,Rongrong Ji,Gen Luo,Xiaoshuai Sun,Jinsong Su,Xinghao Ding,Chia‐Wen Lin,Qi Tian
摘要
Referring expression comprehension (REC) is an emerging research topic in computer vision, which refers to the detection of a target region in an image given a test description. Most existing REC methods follow a multistage pipeline, which is computationally expensive and greatly limits the applications of REC. In this article, we propose a one-stage model toward real-time REC, termed real-time global inference network (RealGIN). RealGIN addresses the issues of expression diversity and complexity of REC with two innovative designs: adaptive feature selection (AFS) and Global Attentive ReAsoNing (GARAN). Expression diversity concerns varying expression content, which includes information such as colors, attributes, locations, and fine-grained categories. To address this issue, AFS adaptively fuses features of different semantic levels to tackle the changes in expression content. In contrast, expression complexity concerns the complex relational conditions in expressions that are used to identify the referent. To this end, GARAN uses the textual feature as a pivot to collect expression-aware visual information from all regions and then diffuses this information back to each region, which provides sufficient context for modeling the relational conditions in expressions. On five benchmark datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, ReferIT, and Flickr30k, the proposed RealGIN outperforms most existing methods and achieves very competitive performances against the most advanced one, i.e., MAttNet. More importantly, under the same hardware, RealGIN can boost the processing speed by 10-20 times over the existing methods.