Adaptive Lambda Least-Squares Temporal Difference Learning

12/30/2016
by   Timothy A. Mann, et al.
0

Temporal Difference learning or TD(λ) is a fundamental algorithm in the field of reinforcement learning. However, setting TD's λ parameter, which controls the timescale of TD updates, is generally left up to the practitioner. We formalize the λ selection problem as a bias-variance trade-off where the solution is the value of λ that leads to the smallest Mean Squared Value Error (MSVE). To solve this trade-off we suggest applying Leave-One-Trajectory-Out Cross-Validation (LOTO-CV) to search the space of λ values. Unfortunately, this approach is too computationally expensive for most practical applications. For Least Squares TD (LSTD) we show that LOTO-CV can be implemented efficiently to automatically tune λ and apply function optimization methods to efficiently search the space of λ values. The resulting algorithm, ALLSTD, is parameter free and our experiments demonstrate that ALLSTD is significantly computationally faster than the naïve LOTO-CV implementation while achieving similar performance.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset