Robust Batch Policy Learning in Markov Decision Processes

11/09/2020
by   Zhengling Qi, et al.
0

We study the sequential decision making problem in Markov decision process (MDP) where each policy is evaluated by a set containing average rewards over different horizon lengths and with different initial distributions. Given a pre-collected dataset of multiple trajectories generated by some behavior policy, our goal is to learn a robust policy in a pre-specified policy class that can maximize the smallest value of this set. Leveraging the semi-parametric efficiency theory from statistics, we develop a policy learning method for estimating the defined robust optimal policy that can efficiently break the curse of horizon under mild technical conditions. A rate-optimal regret bound up to a logarithmic factor is established in terms of the number of trajectories and the number of decision points.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset