计算机科学
判决
图像(数学)
人工智能
自然语言处理
任务(项目管理)
安全性令牌
情态动词
钥匙(锁)
阅读(过程)
模式识别(心理学)
情报检索
化学
计算机安全
管理
政治学
高分子化学
法学
经济
作者
Meihuizi Jia,Xin Shen,Lei Shen,Jinhui Pang,Lejian Liao,Yang Song,Meng Chen,Xiaodong He
标识
DOI:10.1145/3503161.3548427
摘要
Multimodal named entity recognition (MNER) is a vision-language task where the system is required to detect entity spans and corresponding entity types given a sentence-image pair. Existing methods capture text-image relations with various attention mechanisms that only obtain implicit alignments between entity types and image regions. To locate regions more accurately and better model cross-/within-modal relations, we propose a machine reading comprehension based framework for MNER, namely MRC-MNER. By utilizing queries in MRC, our framework can provide prior information about entity types and image regions. Specifically, we design two stages, Query-Guided Visual Grounding and Multi-Level Modal Interaction, to align fine-grained type-region information and simulate text-image/inner-text interactions respectively. For the former, we train a visual grounding model via transfer learning to extract region candidates that can be further integrated into the second stage to enhance token representations. For the latter, we design text-image and inner-text interaction modules along with three sub-tasks for MRC-MNER. To verify the effectiveness of our model, we conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MRC-MNER outperforms the current state-of-the-art models on Twitter2017, and yields competitive results on Twitter2015.
科研通智能强力驱动
Strongly Powered by AbleSci AI