马尔可夫决策过程
强化学习
计算机科学
数学优化
增强学习
马尔可夫链
功能(生物学)
马尔可夫过程
函数逼近
零(语言学)
算法
近似算法
路径(计算)
数学
人工智能
机器学习
人工神经网络
生物
统计
进化生物学
哲学
语言学
程序设计语言
作者
Anna Winnicki,R. Srikant
标识
DOI:10.1109/cdc49753.2023.10384061
摘要
Until recently, efficient policy iteration algorithms for zero-sum Markov games that converge were unknown. Therefore, model-based RL algorithms for such problems could not use policy iteration in the planning modules of the algorithms. In an earlier paper, we showed that a convergent policy iteration algorithm can be obtained by using a commonly used technique in RL called lookahead. However, the algorithm could be applied to the function approximation setting only in the special case of linear MDPs (Markov Decision Processes). In this paper, we obtain performance bounds for policy-based RL algorithms for general settings, including one where policy evaluation is performed using noisy samples of (state, action, reward) triplets from a single sample path of a given policy.
科研通智能强力驱动
Strongly Powered by AbleSci AI