自动汇总
计算机科学
强化学习
偏爱
可扩展性
人工智能
质量(理念)
机器学习
基线(sea)
数学
哲学
统计
海洋学
认识论
数据库
地质学
作者
Harrison Lee,Samrat Phatale,Hassan Mansoor,Kellie Lu,Thomas Mesnard,Chris Bishop,Victor Cărbune,Abhinav Rastogi
出处
期刊:Cornell University - arXiv
日期:2023-01-01
被引量:10
标识
DOI:10.48550/arxiv.2309.00267
摘要
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences. However, gathering high-quality human preference labels can be a time-consuming and expensive endeavor. RL from AI Feedback (RLAIF), introduced by Bai et al., offers a promising alternative that leverages a powerful off-the-shelf LLM to generate preferences in lieu of human annotators. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, RLAIF achieves comparable or superior performance to RLHF, as rated by human evaluators. Furthermore, RLAIF demonstrates the ability to outperform a supervised fine-tuned baseline even when the LLM preference labeler is the same size as the policy. In another experiment, directly prompting the LLM for reward scores achieves superior performance to the canonical RLAIF setup, where LLM preference labels are first distilled into a reward model. Finally, we conduct extensive studies on techniques for generating aligned AI preferences. Our results suggest that RLAIF can achieve human-level performance, offering a potential solution to the scalability limitations of RLHF.
科研通智能强力驱动
Strongly Powered by AbleSci AI