Named Entity Recognition and Relation Extraction for COVID-19: Explainable Active Learning with Word2vec Embeddings and Transformer-Based BERT Models

计算机科学人工智能文字2vec 自然语言处理命名实体识别词（群论）文字嵌入深度学习关系抽取学习迁移机器学习

作者

Mercedes Arguello-Casteleiro,Nava Maroto,Chris Wroe,Carlos Sevillano Torrado,Cory Henson,Julio Des-Diz,M.J. Fernandez-Prieto,TJ Furmston,Diego Maseda Fernandez,Mohak Kulshrestha,Robert Stevens,John Keane,Simon Peters

出处

期刊：Lecture Notes in Computer Science 日期：2021-01-01 卷期号：: 158-163 被引量：3

标识

DOI：10.1007/978-3-030-91100-3_14

摘要

Deep learning for natural language processing acquires dense vector representations for n-grams from large-scale unstructured corpora. Converting static embeddings of n-grams into a dataset of interlinked concepts with explicit contextual semantic dependencies provides the foundation to acquire reusable knowledge. However, the validation of this knowledge requires cross-checking with ground-truths that may be unavailable in an actionable or computable form. This paper presents a novel approach from the new field of explainable active learning that combines methods for learning static embeddings (word2vec models) with methods for learning dynamic contextual embeddings (transformer-based BERT models). We created a dataset for named entity recognition (NER) and relation extraction (REX) for the Coronavirus Disease 2019 (COVID-19). The COVID-19 dataset has 2,212 associations captured by 11 word2vec models with additional examples of use from the biomedical literature. We propose interpreting the NER and REX tasks for COVID-19 as Question Answering (QA) incorporating general medical knowledge within the question, e.g. “does ‘cough’ (n-gram) belong to ‘clinical presentation/symptoms’ for COVID-19?”. We evaluated biomedical-specific pre-trained language models (BioBERT, SciBERT, ClinicalBERT, BlueBERT, and PubMedBERT) versus general-domain pre-trained language models (BERT, and RoBERTa) for transfer learning with COVID-19 dataset, i.e. task-specific fine-tuning considering NER as a sequence-level task. Using 2,060 QA for training (associations from 10 word2vec models) and 152 QA for validation (associations from 1 word2vec model), BERT obtained an F-measure of 87.38%, with precision = 93.75% and recall = 81.82%. SciBERT achieved the highest F-measure of 94.34%, with precision = 98.04% and recall = 90.91%.

求助该文献

最长约 10秒，即可获得该文献文件

Named Entity Recognition and Relation Extraction for COVID-19: Explainable Active Learning with Word2vec Embeddings and Transformer-Based BERT Models

今日热心研友