Improve Mathematical Reasoning in Language Models by Automated Process Supervision

过程(计算) 计算机科学 自动推理 管理科学 自然语言处理 人工智能 程序设计语言 工程类
作者
Liangchen Luo,Zhiwen Yu,Rosanne Liu,Samrat Phatale,Harsh Lara,Yunxuan Li,Lei Shu,Yun Zhu,Lei Meng,Jiao Sun,Abhinav Rastogi
出处
期刊:Cornell University - arXiv 被引量:1
标识
DOI:10.48550/arxiv.2406.06592
摘要

Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a standard inference-time technique aimed at enhancing the reasoning performance of LLMs. However, this still proves insufficient for reasoning tasks with a lengthy or multi-hop reasoning chain, where the intermediate outcomes are neither properly rewarded nor penalized. Process supervision addresses this limitation by assigning intermediate rewards during the reasoning process. To date, the methods used to collect process supervision data have relied on either human annotation or per-step Monte Carlo estimation, both prohibitively expensive to scale, thus hindering the broad application of this technique. In response to this challenge, we propose a novel divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named \textit{OmegaPRM} for the efficient collection of high-quality process supervision data. This algorithm swiftly identifies the first error in the Chain of Thought (CoT) with binary search and balances the positive and negative examples, thereby ensuring both efficiency and quality. As a result, we are able to collect over 1.5 million process supervision annotations to train a Process Reward Model (PRM). Utilizing this fully automated process supervision alongside the weighted self-consistency algorithm, we have enhanced the instruction tuned Gemini Pro model's math reasoning performance, achieving a 69.4\% success rate on the MATH benchmark, a 36\% relative improvement from the 51\% base model performance. Additionally, the entire process operates without any human intervention, making our method both financially and computationally cost-effective compared to existing methods.
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
更新
大幅提高文件上传限制,最高150M (2024-4-1)

科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
刚刚
小白兔完成签到,获得积分10
刚刚
1秒前
1秒前
迎风笑落红完成签到,获得积分10
2秒前
echo完成签到,获得积分10
3秒前
3秒前
3秒前
清风完成签到,获得积分10
4秒前
4秒前
不语笑春风完成签到,获得积分20
5秒前
斯文谷秋发布了新的文献求助10
5秒前
5秒前
无鞅应助云儿多多采纳,获得10
6秒前
6秒前
6秒前
SciGPT应助liuzengzhang666采纳,获得10
7秒前
缓慢冰菱发布了新的文献求助10
8秒前
9秒前
9秒前
10秒前
wanci应助Snoopy采纳,获得30
11秒前
11秒前
不懂事的小孩完成签到,获得积分10
11秒前
organic tirrttf完成签到,获得积分10
11秒前
88完成签到 ,获得积分10
12秒前
yan1875完成签到,获得积分10
13秒前
14秒前
思源应助文静盼兰采纳,获得10
14秒前
张远幸发布了新的文献求助10
14秒前
斯文败类应助seven采纳,获得10
14秒前
缓慢冰菱完成签到,获得积分10
14秒前
594778089发布了新的文献求助10
16秒前
decademe发布了新的文献求助10
17秒前
CodeCraft应助liu采纳,获得10
20秒前
21秒前
Owen应助wangfaqing942采纳,获得30
21秒前
天天玩应助斯文谷秋采纳,获得30
22秒前
研友_8oBxrZ完成签到,获得积分10
22秒前
23秒前
高分求助中
rhetoric, logic and argumentation: a guide to student writers 1000
QMS18Ed2 | process management. 2nd ed 1000
One Man Talking: Selected Essays of Shao Xunmei, 1929–1939 1000
A Chronicle of Small Beer: The Memoirs of Nan Green 1000
From Rural China to the Ivy League: Reminiscences of Transformations in Modern Chinese History 900
Eric Dunning and the Sociology of Sport 850
The Cambridge Introduction to Intercultural Communication 700
热门求助领域 (近24小时)
化学 医学 材料科学 生物 工程类 有机化学 生物化学 物理 内科学 纳米技术 计算机科学 化学工程 复合材料 基因 遗传学 物理化学 催化作用 免疫学 细胞生物学 电极
热门帖子
关注 科研通微信公众号,转发送积分 2916547
求助须知:如何正确求助?哪些是违规求助? 2557126
关于积分的说明 6916523
捐赠科研通 2217141
什么是DOI,文献DOI怎么找? 1178458
版权声明 588403
科研通“疑难数据库(出版商)”最低求助积分说明 576742