数据提取
一致性
可比性
计算机科学
荟萃分析
系统回顾
统计
数据挖掘
人工智能
医学
梅德林
数学
内科学
生物
生物化学
组合数学
作者
Jin K. Kim,Michael Chua,Tian Li,Mandy Rickard,Armando J. Lorenzo
标识
DOI:10.1136/bmjebm-2024-113066
摘要
Objective To assess custom GPT-4 performance in extracting and evaluating data from medical literature to assist in the systematic review (SR) process. Design A proof-of-concept comparative study was conducted to assess the accuracy and precision of custom GPT-4 models against human-performed reviews of randomised controlled trials (RCTs). Setting Four custom GPT-4 models were developed, each specialising in one of the following areas: (1) extraction of study characteristics, (2) extraction of outcomes, (3) extraction of bias assessment domains and (4) evaluation of risk of bias using results from the third GPT-4 model. Model outputs were compared against data from four SRs conducted by human authors. The evaluation focused on accuracy in data extraction, precision in replicating outcomes and agreement levels in risk of bias assessments. Participants Among four SRs chosen, 43 studies were retrieved for data extraction evaluation. Additionally, 17 RCTs were selected for comparison of risk of bias assessments, where both human comparator SRs and an analogous SR provided assessments for comparison. Intervention Custom GPT-4 models were deployed to extract data and evaluate risk of bias from selected studies, and their outputs were compared to those generated by human reviewers. Main outcome measures Concordance rates between GPT-4 outputs and human-performed SRs in data extraction, effect size comparability and inter/intra-rater agreement in risk of bias assessments. Results When comparing the automatically extracted data to the first table of study characteristics from the published review, GPT-4 showed 88.6% concordance with the original review, with <5% discrepancies due to inaccuracies or omissions. It exceeded human accuracy in 2.5% of instances. Study outcomes were extracted and pooling of results showed comparable effect sizes to comparator SRs. A review of bias assessment using GPT-4 showed fair-moderate but significant intra-rater agreement (ICC=0.518, p<0.001) and inter-rater agreements between human comparator SR (weighted kappa=0.237) and the analogous SR (weighted kappa=0.296). In contrast, there was a poor agreement between the two human-performed SRs (weighted kappa=0.094). Conclusion Customized GPT-4 models perform well in extracting precise data from medical literature with potential for utilization in review of bias. While the evaluated tasks are simpler than the broader range of SR methodologies, they provide an important initial assessment of GPT-4's capabilities.
科研通智能强力驱动
Strongly Powered by AbleSci AI