作者
Jiageng Wu,Xiaocong Liu,Minghui Li,Wanxin Li,Zichang Su,Shixu Lin,Lucas Garay,Zhiyun Zhang,Yujie Zhang,Qingcheng Zeng,Jie Shen,Changzheng Yuan,Jie Yang
摘要
Privacy and ethical considerations limit access to large-scale clinical datasets, particularly clinical text data, which contain extensive and diverse information and serve as the foundation for building clinical large language models (LLMs). The limited accessibility of clinical text data impedes the development of clinical artificial intelligence systems and hampers research participation from resource-poor regions and medical institutions, thereby exacerbating health care disparities. In this review, we conduct a global review to identify publicly available clinical text datasets and elaborate on their accessibility, diversity, and usability for clinical LLMs. We screened 3962 papers across medical (PubMed and MEDLINE) and computational linguistic academic databases (the Association for Computational Linguistics Anthology) as well as 239 tasks from prevalent medical natural language processing (NLP) challenges, such as National NLP Clinical Challenges (n2c2). We identified 192 unique clinical text datasets that claimed to be publicly available. Following an institutional review board–approved data-requesting pipeline, access was granted to fewer than half (91 of 192 [47.4%]) of the identified datasets, with an additional 14 (7.3%) datasets being available for regulated access and 87 (45.3%) datasets remaining inaccessible. The publicly available datasets cover nine languages from 14 countries and over 10 million clinical text records, which mostly (88 [95.7%]) originated from the Americas, Europe, and Asia, with none originating from Oceania or Africa, leaving these regions significantly underrepresented. Distribution differences were also evident within the focused clinical context and supported NLP tasks, with intensive care unit (18 [16.8%]), respiratory disease (13 [12.1%]), and cardiovascular disease (11 [10.3%]) gaining significant attention. Named entity recognition (23 [21.7%]), text classification (22 [20.8%]), and event extraction (12 [11.3%]) were the most explored NLP tasks on clinical text datasets. To our knowledge, this is the first systematic review to characterize publicly available clinical text datasets, the foundation of clinical LLMs, highlighting the difficulty in accessibility, underrepresentation across regions and languages, and the challenges posed by the LLMs. Sharing diversified and large-scale clinical text data is necessary, with protection to promote health care research.