计算机科学
可解释性
一套
基线(sea)
一致性(知识库)
任务(项目管理)
集合(抽象数据类型)
人工智能
流利
机器学习
度量(数据仓库)
正确性
自然语言处理
数据挖掘
程序设计语言
经济
考古
哲学
管理
地质学
海洋学
历史
语言学
作者
Olga Golovneva,Moya Chen,Spencer Poff,Martin Corredor,Luke Zettlemoyer,Maryam Fazel-Zarandi,Aslı Çelikyılmaz
出处
期刊:Cornell University - arXiv
日期:2022-01-01
被引量:22
标识
DOI:10.48550/arxiv.2212.07919
摘要
Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality - among other traits - by leveraging properties of step-by-step rationales. We empirically verify the strength of our metrics on five human annotated and six programmatically perturbed diagnostics datasets - covering a diverse set of tasks that require reasoning skills and show that ROSCOE can consistently outperform baseline metrics.
科研通智能强力驱动
Strongly Powered by AbleSci AI