Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

02/19/2022

∙

Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR^2OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR^2OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR^2OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of 𝒪(N^-1/2) even when unknown propensities are nonparametrically estimated. We further extend our results to general f-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.

READ FULL TEXT

Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning

Sign in with Google

Consider DeepAI Pro