Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm
In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime dγ^2>1, where d is the dimension of the feature vector and γ is the discount rate. In this regime, for any q∈[γ^2,1], we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is q/d and it requires Ω(d/γ^2(q-γ^2)ε^2exp(Θ(dγ^2))) samples to approximate the value function up to an additive error ε. Note that the lower bound of the sample complexity is exponential in d. If q=γ^2, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most O(max{‖θ^π‖ _2^4/ε^4logd/δ,1/ε^2(d+log1/δ)}) samples (θ^π is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of ε with probability at least 1-δ.
READ FULL TEXT