计算机科学
推论
信息抽取
域适应
适应(眼睛)
背景(考古学)
情报检索
质量(理念)
命名实体识别
人工智能
领域(数学分析)
学习迁移
自然语言处理
数据挖掘
机器学习
任务(项目管理)
古生物学
数学分析
哲学
物理
数学
管理
认识论
生物
分类器(UML)
光学
经济
作者
Minh-Tien Nguyen,Nguyễn Hồng Sơn,Le Thai Linh
标识
DOI:10.1016/j.eswa.2022.119274
摘要
Information extraction (IE) is a vital step of digitization that reduces paperwork in offices. However, the adaptation of common IE systems to actual business cases faces two issues. First, the number of training samples is small (i.e. 100–200 examples). Second, span extraction models based on question answering formulation require a long time for training and inference. To overcome these issues, we introduce a new query-based model for the extraction of information from business documents. For data limitation, the model employs transfer learning which adapts the knowledge of pre-trained language models (i.e. BERT) to specific domains. To do that, we design a new CNN layer for the adaptation of the model to specific domains. For the speed, different from the encoding of normal span extraction methods (BERT-QA), the proposed model encodes short tags and context documents in two channels in parallel, which speeds up training and inference time. Information from short tags is fused with context documents learned from CNN by using attention to predict start and end positions of extracted spans. Promising results on five domain-specific datasets in English and Japanese indicate that the proposed model produces high-quality outputs and can be applied for business scenarios.
科研通智能强力驱动
Strongly Powered by AbleSci AI