马尔可夫决策过程
计算机科学
任务(项目管理)
强化学习
功能(生物学)
人工智能
机器学习
马尔可夫过程
过程(计算)
数学
工程类
统计
系统工程
进化生物学
生物
操作系统
作者
Pieter Abbeel,Andrew Y. Ng
出处
期刊:International Conference on Machine Learning
日期:2004-01-01
被引量:2753
标识
DOI:10.1145/1015330.1015430
摘要
We consider learning in a Markov decision process where we are not explicitly given a reward function, but where instead we can observe an expert demonstrating the task that we want to learn to perform. This setting is useful in applications (such as the task of driving) where it may be difficult to write down an explicit reward function specifying exactly how different desiderata should be traded off. We think of the expert as trying to maximize a reward function that is expressible as a linear combination of known features, and give an algorithm for learning the task demonstrated by the expert. Our algorithm is based on using "inverse reinforcement learning" to try to recover the unknown reward function. We show that our algorithm terminates in a small number of iterations, and that even though we may never recover the expert's reward function, the policy output by the algorithm will attain performance close to that of the expert, where here performance is measured with respect to the expert's unknown reward function.
科研通智能强力驱动
Strongly Powered by AbleSci AI