Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds
We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We identify a key property of the EntRM, the monotonicity-preserving property, which enables the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that both of them attain 𝒪̃(exp(|β| H)-1/|β|HH√(HS^2AT)) regret upper bound, where S is the number of states, A the number of states, H the time horizon and T the number of total time steps. It matches RSVI2 proposed in <cit.> with a much simpler regret analysis. To the best of our knowledge, this is the first regret analysis of DRL, which bridges DRL and RSRL in terms of sample complexity. Finally, we improve the existing lower bound by proving a tighter bound of Ω(exp(β H/6)-1/β HH√(SAT)) for β>0 case, which recovers the tight lower bound Ω(H√(SAT)) in the risk-neutral setting.
READ FULL TEXT