Convergent multiple-timescales reinforcement learning algorithms in normal form games

David S. Leslie and E. J. Collins
We consider reinforcement learning algorithms in normal form games. Using two-timescales stochastic approximation we introduce a model-free algorithm which is asymptotically equivalent to smooth fictitious play, since both result in asymptotic pseudotrajectories to the flow defined by the smooth best response dynamics. Both of these algorithms are shown to converge almost surely to the Nash distribution in two-player zero-sum games and N-player partnership games. However there are simple games for which these, and most other adaptive processes, fail to converge - in particular we consider the N-player matching pennies game and Shapley's variant of the rock-scissors-paper game. By extending stochastic approximation results to multiple timescales we can allow each player to learn at a different rate. This extension will converge for two-player zero-sum games and two-player partnership games. It will also converge for the two special cases we consider.

Some key words:
Stochastic approximation, reinforcement learning, repeated normal form games, best response dynamics.