期刊:IEEE transactions on artificial intelligence [Institute of Electrical and Electronics Engineers] 日期:2024-03-20卷期号:5 (7): 3607-3623被引量:1
标识
DOI:10.1109/tai.2024.3379109
摘要
Reinforcement Learning (RL) algorithms combined with deep learning architectures have achieved tremendous success in many practical applications. However, the policies obtained by many Deep Reinforcement Learning (DRL) algorithms are seen to suffer from high variance making them less useful in safety-critical applications. In general, it is desirable to have algorithms that give a low iterate-variance while providing a high long-term reward. In this work, we consider the Actor-Critic paradigm, where the critic is responsible for evaluating the policy while the feedback from the critic is used by the actor for updating the policy. The updates of both the critic and the actor in the standard Actor-Critic procedure are run concurrently until convergence. It has been previously observed that updating the actor once after every L > 1 steps of the critic reduces the iterate variance. In this paper, we address the question of what optimal L -value to use in the recursions and propose a data-driven L -update rule that runs concurrently with the actor-critic algorithm with the objective being to minimize the variance of the infinite horizon discounted reward. This update is based on a random search (discrete) parameter optimization procedure that incorporates smoothed functional (SF) estimates. We prove the convergence of our proposed multi-timescale scheme to the optimal L and optimal policy pair. Subsequently, through numerical evaluations on benchmark RL tasks, we demonstrate the advantages of our proposed algorithm over multiple state-of-the-art algorithms in the literature.