Reinforcement Learning: a Comparison of UCB Versus Alternative Adaptive Policies

09/13/2019
by   Wesley Cowan, et al.
1

In this paper we consider the basic version of Reinforcement Learning (RL) that involves computing optimal data driven (adaptive) policies for Markovian decision process with unknown transition probabilities. We provide a brief survey of the state of the art of the area and we compare the performance of the classic UCB policy of bkmdp97 with a new policy developed herein which we call MDP-Deterministic Minimum Empirical Divergence (MDP-DMED), and a method based on Posterior sampling (MDP-PS).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset