Named Entity Recognition and Relation Extraction for COVID-19: Explainable Active Learning with Word2vec Embeddings and Transformer-Based BERT Models

计算机科学 人工智能 文字2vec 自然语言处理 命名实体识别 词(群论) 文字嵌入 深度学习 关系抽取 学习迁移 机器学习
作者
Mercedes Arguello-Casteleiro,Nava Maroto,Chris Wroe,Carlos Sevillano Torrado,Cory Henson,Julio Des-Diz,M.J. Fernandez-Prieto,TJ Furmston,Diego Maseda Fernandez,Mohak Kulshrestha,Robert Stevens,John Keane,Simon Peters
出处
期刊:Lecture Notes in Computer Science 卷期号:: 158-163 被引量:3
标识
DOI:10.1007/978-3-030-91100-3_14
摘要

Deep learning for natural language processing acquires dense vector representations for n-grams from large-scale unstructured corpora. Converting static embeddings of n-grams into a dataset of interlinked concepts with explicit contextual semantic dependencies provides the foundation to acquire reusable knowledge. However, the validation of this knowledge requires cross-checking with ground-truths that may be unavailable in an actionable or computable form. This paper presents a novel approach from the new field of explainable active learning that combines methods for learning static embeddings (word2vec models) with methods for learning dynamic contextual embeddings (transformer-based BERT models). We created a dataset for named entity recognition (NER) and relation extraction (REX) for the Coronavirus Disease 2019 (COVID-19). The COVID-19 dataset has 2,212 associations captured by 11 word2vec models with additional examples of use from the biomedical literature. We propose interpreting the NER and REX tasks for COVID-19 as Question Answering (QA) incorporating general medical knowledge within the question, e.g. “does ‘cough’ (n-gram) belong to ‘clinical presentation/symptoms’ for COVID-19?”. We evaluated biomedical-specific pre-trained language models (BioBERT, SciBERT, ClinicalBERT, BlueBERT, and PubMedBERT) versus general-domain pre-trained language models (BERT, and RoBERTa) for transfer learning with COVID-19 dataset, i.e. task-specific fine-tuning considering NER as a sequence-level task. Using 2,060 QA for training (associations from 10 word2vec models) and 152 QA for validation (associations from 1 word2vec model), BERT obtained an F-measure of 87.38%, with precision = 93.75% and recall = 81.82%. SciBERT achieved the highest F-measure of 94.34%, with precision = 98.04% and recall = 90.91%.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
feng完成签到,获得积分10
1秒前
1秒前
美丽稀完成签到,获得积分10
2秒前
PXY应助屁王采纳,获得10
2秒前
sunburst完成签到,获得积分10
2秒前
狼主完成签到 ,获得积分10
2秒前
吕亦寒完成签到,获得积分10
2秒前
junzilan发布了新的文献求助10
3秒前
ZL发布了新的文献求助10
3秒前
3秒前
亻鱼完成签到,获得积分10
3秒前
超级蘑菇完成签到 ,获得积分10
4秒前
4秒前
4秒前
congguitar完成签到,获得积分10
4秒前
5秒前
limof完成签到,获得积分20
5秒前
跳跃聪健发布了新的文献求助10
5秒前
168521kf完成签到,获得积分10
5秒前
6秒前
Avatar完成签到,获得积分10
6秒前
6秒前
小田完成签到,获得积分10
7秒前
JJJ应助大气沅采纳,获得10
7秒前
8秒前
kydd驳回了桐桐应助
8秒前
9秒前
9秒前
9秒前
英俊的铭应助洛尚采纳,获得10
9秒前
10秒前
在水一方应助Harlotte采纳,获得10
10秒前
廖天佑完成签到,获得积分0
10秒前
SweepingMonk应助梁小鑫采纳,获得10
10秒前
DTBTY完成签到,获得积分10
11秒前
11秒前
11秒前
11秒前
JACK发布了新的文献求助10
12秒前
小宋同学不能怂完成签到 ,获得积分10
12秒前
高分求助中
Continuum Thermodynamics and Material Modelling 3000
Production Logging: Theoretical and Interpretive Elements 2700
Social media impact on athlete mental health: #RealityCheck 1020
Ensartinib (Ensacove) for Non-Small Cell Lung Cancer 1000
Unseen Mendieta: The Unpublished Works of Ana Mendieta 1000
Bacterial collagenases and their clinical applications 800
El viaje de una vida: Memorias de María Lecea 800
热门求助领域 (近24小时)
化学 材料科学 生物 医学 工程类 有机化学 生物化学 物理 纳米技术 计算机科学 内科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 量子力学 光电子学 冶金
热门帖子
关注 科研通微信公众号,转发送积分 3527521
求助须知:如何正确求助?哪些是违规求助? 3107606
关于积分的说明 9286171
捐赠科研通 2805329
什么是DOI,文献DOI怎么找? 1539901
邀请新用户注册赠送积分活动 716827
科研通“疑难数据库(出版商)”最低求助积分说明 709740