水准点(测量)
医学教育
计算机科学
心理学
医学
地理
地图学
作者
Inioluwa Deborah Raji,Roxana Daneshjou,Emily Alsentzer
摘要
Medical licensing examinations, such as the United States Medical Licensing Examination, have become the default benchmarks for evaluating large language models (LLMs) in health care. Performance on these benchmarks is frequently cited as evidence of progress and used to justify the deployment of LLMs into clinical settings. However, we argue that these benchmarks are fundamentally limited as signals for assessing true clinical utility.
科研通智能强力驱动
Strongly Powered by AbleSci AI