When It Comes to Benchmarks, Humans Are the Only Way
计算机科学
计算机安全
业务
作者
Adam Rodman,Laura Zwaan,Andrew Olson,Arjun K. Manrai
标识
DOI:10.1056/aie2500143
摘要
Improved performance of large language models (LLMs) on traditional reasoning assessments has led to benchmark saturation. This has spurred efforts to develop new benchmarks, including synthetic computational simulations of clinical practice involving multiple AI agents. We argue that it is crucial to ground such efforts in extensive human validation. We conclude by providing four recommendations for researchers to better evaluate LLMs for clinical practice.