Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes

08/22/2019
by   Nathan Kallus, et al.
0

Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of q-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness efficiently.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/12/2019

Efficiently Breaking the Curse of Horizon: Double Reinforcement Learning in Infinite-Horizon Processes

Off-policy evaluation (OPE) in reinforcement learning is notoriously dif...
research
09/08/2022

Double Q-Learning for Citizen Relocation During Natural Hazards

Natural disasters can cause substantial negative socio-economic impacts ...
research
12/12/2016

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

This paper presents a new method to learn online policies in continuous ...
research
06/09/2019

Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning

Off-policy evaluation (OPE) in both contextual bandits and reinforcement...
research
02/08/2016

Data-Efficient Reinforcement Learning in Continuous-State POMDPs

We present a data-efficient reinforcement learning algorithm resistant t...
research
10/14/2021

Learning When and What to Ask: a Hierarchical Reinforcement Learning Framework

Reliable AI agents should be mindful of the limits of their knowledge an...
research
04/14/2020

Extrapolation in Gridworld Markov-Decision Processes

Extrapolation in reinforcement learning is the ability to generalize at ...

Please sign up or login with your details

Forgot password? Click here to reset