Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward

05/04/2023
by   Washim Uddin Mondal, et al.
0

We investigate an infinite-horizon average reward Markov Decision Process (MDP) with delayed, composite, and partially anonymous reward feedback. The delay and compositeness of rewards mean that rewards generated as a result of taking an action at a given state are fragmented into different components, and they are sequentially realized at delayed time instances. The partial anonymity attribute implies that a learner, for each state, only observes the aggregate of past reward components generated as a result of different actions taken at that state, but realized at the observation instance. We propose an algorithm named DUCRL2 to obtain a near-optimal policy for this setting and show that it achieves a regret bound of 𝒪̃(DS√(AT) + d (SA)^3) where S and A are the sizes of the state and action spaces, respectively, D is the diameter of the MDP, d is a parameter upper bounded by the maximum reward delay, and T denotes the time horizon. This demonstrates the optimality of the bound in the order of T, and an additive impact of the delay.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset