Rapid identification of chronic kidney disease in electronic health record database using computable phenotype combining a common data model

鉴定(生物学) 肾脏疾病 医学 表型 疾病 电子健康档案 数据库 计算机科学 生物信息学 数据挖掘 内科学 生物 遗传学 基因 植物 医疗保健 经济 经济增长
作者
Huaiyu Wang,Juan Du,Yu Yang,Hongbo Lin,Bin Bao,Guohui Ding,Chao Yang,Guilan Kong,Luxia Zhang
出处
期刊:Chinese Medical Journal [Ovid Technologies (Wolters Kluwer)]
卷期号:136 (7): 874-876 被引量:1
标识
DOI:10.1097/cm9.0000000000002168
摘要

To the Editor: Chronic kidney disease (CKD) is a global burden of the public health. The global prevalence of CKD exceeded 10% while the awareness was around 10%.[1] In the era of big data, improving the identification of CKD using informatic tools is important. Computable phenotype is proven as an efficient tool to facilitate the process of patient identification using electronic health record (EHR) data. It is an automatic algorithm identifying the target population through objective criteria with logic statements. Effective implementation of a computable phenotype depends on valid mapping of raw data to a standard set of data and definitions. Previous studies developed computable phenotypes for CKD identification in English by using the Logical Observation Identifiers Names and Codes (LOINC) and the International Classification of Diseases (ICD) codes.[2,3] With the limited utilization of these codes and the language barrier, implementing these computable phenotypes in non-English circumstances and/or in the absence of identical coding system is difficult. Common data model (CDM) was reported as a solution for data standardization and the localization of computable phenotypes.[4] The core of CDM is the extraction of key elements, transforming into a standard terminology and loading into a standard schema extraction, transformation, loading (ETL). Currently, various CDMs with different original aims, such as the Observational Medical Outcomes Partnership CDM, Sentinel CDM, and the Patient-Centered Outcomes Research Network CDM, had been widely used and successfully facilitated the standardization of EHR data. Sentinel previously posted coding trend analyses on kidney disease, and only ICD-9 codes and ICD-10 codes were included. The CDM for CKD characterization was still lacking. The confirmation of CKD takes at least 3 months. This condition hinders the timely diagnosis and increases the missed diagnosis of CKD in clinical practice, especially for patients seeking health care in different institutes.[5] EHR database collects healthcare data continuously across institutes and updates those in real time. Monitoring and identifying the patients with CKD by using an informatic tool based on this database are promising. Collectively, speculating that a computable phenotype combining a CDM might facilitate the CKD-related data extraction and CKD identification using EHR data is reasonable. Yinzhou is a district with a population of 1.6 million people located in Ningbo Zhejiang province, China. The Regional Health Information System (RHIS) in Yinzhou collected EHRs of residents and updated the database in real time. Using this database, a unique identity code (PERSONKEY) was generated by using personal ID, sex, date of birth, and name and was adopted to recognize the identical person, link the health profiles in different sub-databases, and generate the complete EHRs. The EHRs of 976,409 adults with medical records were extracted as the raw data for the following analyses [Supplementary Figure 1, https://links.lww.com/CM9/B73]. This study was approved by the ethics committee of Peking University First Hospital. The CDM for CKD characterization was designed in accordance with the principles described in The Book of OHDSI: Observational Health Data Sciences and Informatics. In accordance with the Kidney Disease: Improving Global Outcomes (KDIGO) clinical guidelines for CKD (2012), the key elements for CKD identification were defined as age, sex, kidney function, and urine abnormality.[6] Hence, Data Domain of CDM for CKD identification was designed as demographics, laboratory tests, and diagnosis. Standard terminology of data domains was defined in accordance with the KDIGO-CKD clinical guidelines and ICD-10 codes in English and in Chinese. Forms containing demographics (age, sex), laboratory tests (kidney function, albuminuria, proteinuria, hematuria), and diagnosis (ICD-10 codes and texts) in the EHR database were integrated by PERSONKEY. Altogether, 10,981,723 medical records of 976,409 individuals in the EHR database were prepared for the extraction of original vocabularies [Supplementary Figure 1, https://links.lww.com/CM9/B73]. The mapping rules between original vocabularies and the standard terminology were established through manual annotation and format conversion. Two nephrologists independently conducted the annotation and one informaticist performed the mapping [Figure 1].Figure 1: Process of the development of CDM for CKD characterization and computable phenotype for CKD identification. CDM: Common data model; CKD: Chronic kidney disease; eGFR: Estimated glomerular filtration rate; EHR: Electronic health record; ICD: International Classification of Diseases.The algorithm of the computable phenotype for CKD identification was designed in accordance with KDIGO clinical guidelines for CKD[6] [Figure 1]. On the basis of the standard terminology of CDM, patients showing at least one of the following manifestations lasting for >3 months were defined as having CKD: (1) reduced kidney function: estimated glomerular filtration rate (eGFR) <60 mL·min−1 · 1.73 m−2); (2) albuminuria: urine albumin-to-creatinine ratio ≥30 mg/g or urine albumin concentration ≥20 mg/L; (3) proteinuria: urine protein-to-creatinine ratio ≥150 mg/g, or 24 h proteinuria ≥150 mg/24 h, or urinalysis protein ≥+1; (4) hematuria without non-CKD related causes including urologic neoplasms, urinary tract infection and injury. Criteria for hematuria: urine red blood cell ≥3 cells/HPF (or >28 cells/μL) or urine occult blood ≥+2; (5) CKD-related diagnosis including primary, secondary or congenital kidney disease, renal vascular disease, maintenance dialysis and recipient/donor of kidney transplantation [Supplementary Table 1, https://links.lww.com/CM9/B73]. Patients who received re-tests over a period of 3 months and were confirmed with the absence of the abovementioned manifestations were defined as normal cases. Patients who presented these manifestations for ≤3 months or did not receive any re-test were defined as cases to be addressed and will be processed in the next iteration of CKD identification. [Figure 1]. In accordance with the number of individuals with EHRs and considering the diversity of EHR infrastructures and data sources, seven institutes were selected from 42 healthcare institutes in Yinzhou to implement the computable phenotype based on the CDM. In total, three tertiary general hospitals, two specialty hospitals (a maternity and children's hospital and an orthopedic hospital), one secondary general hospital, and one community health center were selected. The performance of the computable phenotype was validated through manual review. Cases identified as with/without CKD were randomly selected, and their original records of demographics, diagnosis, and laboratory tests were manually reviewed by two nephrologists. For those without CKD, all diagnosis and CKD-related laboratory tests in the database were extracted and manually reviewed. For those with CKD, all diagnosis and laboratory tests from the date of presentation of CKD to the endpoint of the database were extracted and manually reviewed. Panel discussion was held when they have different opinions. Review by nephrologists was defined as the gold standard for CKD identification. The data processing and computation in the RHIS were based on the Hadoop framework. The computing engine was Spark, and the data warehouse was Hive as the support for structured query language (SQL) (The Apache Software Foundation, Wakefield, United Kingdom). The ETL process of CDM and the implementation of the computable phenotype were conducted using SQL statements. The demographic and clinical characteristics of CKD-identified patients were analyzed. The stages of CKD-identified patients were evaluated in terms of the levels of eGFR and presented in G1–G5. Continuous and categorical variables were presented as mean ± standard deviation and frequency, respectively. The performance of the computable phenotype was evaluated in terms of sensitivity, specificity, and accuracy and analyzed using MedCalc 15.8 (MedCalc Software Ltd., Ostend, Belgium). The standard terminology for CKD characterization is shown in Figure 1. The bilingual terminology is presented in Supplementary Table 2, https://links.lww.com/CM9/B73. A total of 617 original vocabularies for laboratory tests were found and standardized by processing 10,981,723 medical records of 976,409 individuals from 42 medical institutes. The formats of date, categorical data, and unit of test were converted. By manual annotation, 111 types of diagnosis (corresponding to 171 types of ICD-10 codes in English and Chinese versions) including primary, secondary and congenital kidney disease, renal vascular disease, and uremia-related diagnosis were reorganized as CKD-related diagnosis. [Supplementary Table 1, https://links.lww.com/CM9/B73] By scanning 21,474,008 records of laboratory tests and diagnoses of 557,719 individuals in seven medical institutes, 64,036 (11.5%) patients with CKD were identified by the computable phenotype. In China, patients commonly seek health care across different institutes. Thus, the EHRs of more than half of residents in the whole database were extracted from the seven representative institutes. Among them, 55,682 (87.0%) patients received serum creatinine tests. The majority of patients were in early stages (G1: 33,315 cases [59.8%]; G2: 12,980 cases [23.3%]). Patients in G1 were the youngest (53.7 ± 14.0 years), whereas patients in G4 were the oldest (82.3 ± 14.6 years). The highest proportion of hematuria and albuminuria/proteinuria was observed in G1 (17,187 cases [51.6%]) and G5 (417 cases [51.3%]), respectively. The frequency of patients labeled with CKD-related ICD-10 code increased from G1 (16,795 cases[50.4%]) to G5 (737 cases [90.7%]) [Supplementary Table 3, https://links.lww.com/CM9/B73]. In total, the EHRs of 50 CKD-identified cases and 50 cases without CKD were randomly sampled and reviewed by two nephrologists. Fifty CKD-identified cases were confirmed as disease present and three cases without CKD were defined as mis-classified because they did not meet the criterion of re-testing over 3 months. The sensitivity, specificity, and accuracy of the computable phenotype for CKD identification were 94.3%, 100.0%, and 97.0%, respectively [Supplementary Table 4, https://links.lww.com/CM9/B73]. Compared with the previous models, the present computable phenotype particularly considered the utilization of existing non-uniform data and its capacity of localization across databases with different settings. Nadkarni et al[3] developed a computable phenotype to identify patients with CKD in the population with diabetes and/or hypertension based on eMERGE network. Their algorithm mainly relied on ICD-9 codes. Hence, the performance of their computable phenotype was influenced by the missing rate of diagnosis records and/or the awareness. Norton et al[2] developed an NKDEP e-phenotype for CKD identification using laboratory tests, which were extracted through LOINC. Obviously, National Kidney Disease Education Program (NKDEP) e-phenotype avoided the influence of diagnosis rate effectively, but its dependence of LOINC limited the localization in a database without LOINC. The algorithm of the present computable phenotype combined CKD-related diagnostic records and laboratory tests to improve the data utilization and the identification rate. The terminology of the CDM preferred standard description rather than a coding system, so as to reserve the potential for further expansion in foreign databases in the absence of the identical coding system. In accordance with the present results of implementation, the EHR data in different levels of healthcare institutes were scanned successfully and the prevalence of CKD and the characteristics of identified-CKD patients were consistent with previous nationally representative study.[7] This condition demonstrated the effectiveness of the design embedding a CDM into the computable phenotype. The present study established a reproducible paradigm for the design and construction of CDM and computable phenotype in other fields and databases. First, slightly expanding the criteria for disease identification based on the standard definition of the disease is allowable to balance the utilization of data and the rate of identification. Second, embedding a CDM into the computable phenotype can improve the efficiency of its implementation across different databases. Third, a CDM containing non-monotonic terminology will increase the potentiality for the localization. Finally, the correspondence between the English and Chinese terminologies can be the interface to link the data in Chinese and the existing resources and techniques in English. This strategy may be feasible to promote the data extraction and information exchange in other languages. The present study is the first research to establish a computable phenotype for CKD identification based on the CDM with a bilingual terminology for CKD characterization. This study develops an efficient tool for CKD identification based on a real-world EHR database and provides a potential interface, the CDM, for the generalization of the computable phenotype across English and Chinese settings of database. Funding This study was supported by grants from the National Natural Science Foundation of China (Nos. 82100741, 82003529, 91846101, 81771938, 81900665, 82090021), Beijing Municipal Science and Technology Commission (Grant No. 7212201), the University of Michigan Health System-Peking University Health Science Center Joint Institute for Translational and Clinical Research (Nos. BMU2020JI011, BMU2019JI005, BMU2018JI012), Beijing Nova Programme Interdisciplinary Cooperation Project (No. Z191100001119008), National Key R&D Program of the Ministry of Science and Technology of China (No. 2019YFC2005000), the National Key Research and Development Program of China (No. 2018AAA0102100), PKU-Baidu Fund (Nos. 2020BD005, 2019BD017), and CAMS Innovation Fund for Medical Sciences (No. 2019-I2M-5-046). Conflicts of interest None.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
2秒前
妖妖发布了新的文献求助10
2秒前
lgh完成签到,获得积分10
3秒前
Alberta完成签到,获得积分10
4秒前
pzd发布了新的文献求助10
4秒前
向xiang123发布了新的文献求助10
5秒前
智博36发布了新的文献求助10
6秒前
大个应助闪闪妍采纳,获得10
8秒前
英俊的铭应助JinwenShi采纳,获得10
9秒前
欢喜的凡完成签到 ,获得积分10
9秒前
9秒前
彭于晏应助向xiang123采纳,获得10
9秒前
善学以致用应助Sarah采纳,获得80
11秒前
12秒前
Orange应助11234采纳,获得10
12秒前
zzz完成签到,获得积分10
12秒前
妖妖完成签到,获得积分10
14秒前
至乐无乐发布了新的文献求助30
14秒前
野原发布了新的文献求助10
15秒前
整齐凌萱发布了新的文献求助10
16秒前
16秒前
暮茵完成签到 ,获得积分10
16秒前
pzd完成签到,获得积分10
17秒前
xuan完成签到,获得积分10
18秒前
Vivi完成签到,获得积分10
18秒前
大气沛槐给大气沛槐的求助进行了留言
20秒前
Vera完成签到,获得积分10
20秒前
上官若男应助科研通管家采纳,获得10
20秒前
隐形曼青应助科研通管家采纳,获得10
20秒前
20秒前
小蘑菇应助科研通管家采纳,获得10
20秒前
20秒前
小二郎应助科研通管家采纳,获得30
20秒前
ding应助科研通管家采纳,获得10
20秒前
爆米花应助科研通管家采纳,获得10
20秒前
无花果应助科研通管家采纳,获得10
20秒前
Akim应助科研通管家采纳,获得10
21秒前
IBMffff应助科研通管家采纳,获得10
21秒前
Jasper应助科研通管家采纳,获得10
21秒前
高分求助中
Sustainability in Tides Chemistry 2800
Kinetics of the Esterification Between 2-[(4-hydroxybutoxy)carbonyl] Benzoic Acid with 1,4-Butanediol: Tetrabutyl Orthotitanate as Catalyst 1000
The Young builders of New china : the visit of the delegation of the WFDY to the Chinese People's Republic 1000
Rechtsphilosophie 1000
Very-high-order BVD Schemes Using β-variable THINC Method 568
Chen Hansheng: China’s Last Romantic Revolutionary 500
Mantiden: Faszinierende Lauerjäger Faszinierende Lauerjäger 500
热门求助领域 (近24小时)
化学 医学 生物 材料科学 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 催化作用 物理化学 免疫学 量子力学 细胞生物学
热门帖子
关注 科研通微信公众号,转发送积分 3138860
求助须知:如何正确求助?哪些是违规求助? 2789795
关于积分的说明 7792655
捐赠科研通 2446147
什么是DOI,文献DOI怎么找? 1300890
科研通“疑难数据库(出版商)”最低求助积分说明 626066
版权声明 601079