蒙特卡罗树搜索
后悔
马尔可夫决策过程
对数
树(集合论)
计算机科学
数学优化
强化学习
背景(考古学)
贝尔曼方程
蒙特卡罗方法
多武装匪徒
马尔可夫过程
数学
组合数学
人工智能
机器学习
统计
生物
数学分析
古生物学
作者
Devavrat Shah,Qiaomin Xie,Zhi Xu
标识
DOI:10.1145/3393691.3394202
摘要
In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo Tree Search (MCTS), in the context of infinite-horizon discounted cost Markov Decision Process (MDP) with deterministic transitions. While MCTS is believed to provide an approximate value function for a given state with enough simulations, cf. [Kocsis and Szepesvari 2006; Kocsis et al. 2006], the claimed proof of this property is incomplete. This is due to the fact that the variant of MCTS, the Upper Confidence Bound for Trees (UCT), analyzed in prior works utilizes "logarithmic" bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multi-arm bandit (MAB) literature, cf. [Agrawal 1995; Auer et al. 2002]. In effect, such an approach assumes that the regret of the underlying recursively dependent non-stationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold as pointed out in [Audibert et al. 2009], even for stationary MABs.
科研通智能强力驱动
Strongly Powered by AbleSci AI