自动汇总
计算机科学
背景(考古学)
匹配(统计)
模式
水准点(测量)
人工智能
情报检索
自然语言处理
领域(数学)
古生物学
地理
纯数学
社会学
统计
生物
社会科学
数学
大地测量学
作者
Leigang Qu,Meng Liu,Da Cao,Liqiang Nie,Qi Tian
出处
期刊:ACM Multimedia
日期:2020-10-12
被引量:77
标识
DOI:10.1145/3394171.3413961
摘要
Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines.
科研通智能强力驱动
Strongly Powered by AbleSci AI