Reward Biased Maximum Likelihood Estimation for Reinforcement Learning
The principle of Reward-Biased Maximum Likelihood Estimate Based Adaptive Control (RBMLE) that was proposed in Kumar and Becker (1982) is an alternative approach to the Upper Confidence Bound Based (UCB) Approach (Lai and Robbins, 1985) for employing the principle now known as "optimism in the face of uncertainty" (Auer et al., 2002). It utilizes a modified maximum likelihood estimate, with a bias towards those Markov Decision Process (MDP) models that yield a higher average reward. However, its regret performance has never been analyzed earlier for reinforcement learning (RL (Sutton et al., 1998)) tasks that involve the optimal control of unknown MDPs. We show that it has a learning regret of O(log T ) where T is the time-horizon, similar to the state-of-art algorithms. It provides an alternative general purpose method for solving RL problems.
READ FULL TEXT