超参数
单调函数
信任域
多样性(控制论)
方案(数学)
非线性系统
计算机科学
数学优化
人工神经网络
深层神经网络
人工智能
强化学习
优化算法
数学
数学分析
物理
量子力学
半径
计算机安全
作者
John Schulman,Sergey Levine,Pieter Abbeel,Michael I. Jordan,Philipp Moritz
出处
期刊:International Conference on Machine Learning
日期:2015-07-06
卷期号:: 1889-1897
被引量:2928
摘要
In this article, we describe a method for optimizing control policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified scheme, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
科研通智能强力驱动
Strongly Powered by AbleSci AI