计算机科学
实体链接
情报检索
基线(sea)
管道(软件)
命名实体识别
端到端原则
超链接
管道运输
水准点(测量)
信息抽取
麻省理工许可证
图形
数据挖掘
许可证
万维网
人工智能
网页
任务(项目管理)
知识库
地质学
工程类
操作系统
经济
海洋学
环境工程
理论计算机科学
管理
程序设计语言
地理
大地测量学
作者
Pierre-Yves Genest,Pierre-Édouard Portier,Előd Egyed-Zsigmond,Martino Lovisetto
标识
DOI:10.1145/3539618.3591912
摘要
Information Extraction (IE) pipelines aim to extract meaningful entities and relations from documents and structure them into a knowledge graph that can then be used in downstream applications. Training and evaluating such pipelines requires a dataset annotated with entities, coreferences, relations, and entity-linking. However, existing datasets either lack entity-linking labels, are too small, not diverse enough, or automatically annotated (that is, without a strong guarantee of the correction of annotations). Therefore, we propose Linked-DocRED, to the best of our knowledge, the first manually-annotated, large-scale, document-level IE dataset. We enhance the existing and widely-used DocRED dataset with entity-linking labels that are generated thanks to a semi-automatic process that guarantees high-quality annotations. In particular, we use hyperlinks in Wikipedia articles to provide disambiguation candidates. We also propose a complete framework of metrics to benchmark end-to-end IE pipelines, and we define an entity-centric metric to evaluate entity-linking. The evaluation of a baseline shows promising results while highlighting the challenges of an end-to-end IE pipeline. Linked-DocRED, the source code for the entity-linking, the baseline, and the metrics are distributed under an open-source license and can be downloaded from a public repository.
科研通智能强力驱动
Strongly Powered by AbleSci AI