强化学习
贝尔曼方程
嵌入
计算机科学
功能(生物学)
差异(会计)
数学优化
价值(数学)
数学
人工智能
机器学习
经济
进化生物学
生物
会计
作者
Jingliang Duan,Yang Guan,Shengbo Eben Li,Yangang Ren,Qi Sun,Bo Cheng
出处
期刊:IEEE transactions on neural networks and learning systems
[Institute of Electrical and Electronics Engineers]
日期:2021-06-09
卷期号:33 (11): 6584-6598
被引量:119
标识
DOI:10.1109/tnnls.2021.3082568
摘要
In reinforcement learning (RL), function approximation errors are known to easily lead to the Q -value overestimations, thus greatly reducing policy performance. This article presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q -value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q -value overestimations because it is capable of adaptively adjusting the update step size of the Q -value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI