强化学习
计算机科学
正规化(语言学)
Kullback-Leibler散度
人工智能
Softmax函数
熵(时间箭头)
数学优化
机器学习
数学
深度学习
物理
量子力学
作者
Zhiwei Shang,Renxing Li,Chunhua Zheng,Huiyun Li,Yunduan Cui
出处
期刊:IEEE transactions on neural networks and learning systems
[Institute of Electrical and Electronics Engineers]
日期:2024-01-01
卷期号:: 1-11
被引量:1
标识
DOI:10.1109/tnnls.2023.3329513
摘要
In this article, a novel reinforcement learning (RL) approach, continuous dynamic policy programming (CDPP), is proposed to tackle the issues of both learning stability and sample efficiency in the current RL methods with continuous actions. The proposed method naturally extends the relative entropy regularization from the value function-based framework to the actor-critic (AC) framework of deep deterministic policy gradient (DDPG) to stabilize the learning process in continuous action space. It tackles the intractable softmax operation over continuous actions in the critic by Monte Carlo estimation and explores the practical advantages of the Mellowmax operator. A Boltzmann sampling policy is proposed to guide the exploration of actor following the relative entropy regularized critic for superior learning capability, exploration efficiency, and robustness. Evaluated by several benchmark and real-robot-based simulation tasks, the proposed method illustrates the positive impact of the relative entropy regularization including efficient exploration behavior and stable policy update in RL with continuous action space and successfully outperforms the related baseline approaches in both sample efficiency and learning stability.
科研通智能强力驱动
Strongly Powered by AbleSci AI