后悔
有界函数
有限理性
理论(学习稳定性)
计算机科学
博弈论
数理经济学
集合(抽象数据类型)
理性代理人
数学
人工智能
机器学习
数学分析
程序设计语言
作者
Stefanos Leonardos,Georgios Piliouras
标识
DOI:10.1016/j.artint.2021.103653
摘要
Exploration-exploitation is a powerful and practical tool in multi-agent learning (MAL); however, its effects are far from understood. To make progress in this direction, we study a smooth analogue of Q-learning. We start by showing that our learning model has strong theoretical justification as an optimal model for studying exploration-exploitation. Specifically, we prove (1) that smooth Q-learning has bounded regret in arbitrary games for a cost model that explicitly balances game-rewards and exploration-costs, i.e., costs from testing potentially suboptimal actions, and (2) that it always converges to the set of quantal-response equilibria (QRE), the standard solution concept for games with bounded rationality, in arbitrary weighted potential games. In our main task, we then turn to measure the effect of exploration on collective system performance. We characterize the geometry of the QRE surface in low-dimensional MAL systems and link our findings with catastrophe (bifurcation) theory. In particular, as the exploration hyperparameter evolves over-time, the system undergoes phase transitions where the number and stability of equilibria can change radically given an infinitesimal change to the exploration parameter. Based on this, we provide a formal theoretical treatment of how tuning the exploration parameter can provably lead to equilibrium selection with both positive as well as negative (and potentially unbounded) effects to system performance.
科研通智能强力驱动
Strongly Powered by AbleSci AI