Delayed Feedback in Generalised Linear Bandits Revisited

07/21/2022
by   Benjamin Howson, et al.
0

The stochastic generalised linear bandit is a well-understood model for sequential decision-making problems, with many algorithms achieving near-optimal regret guarantees under immediate feedback. However, in many real world settings, the requirement that the reward is observed immediately is not applicable. In this setting, standard algorithms are no longer theoretically understood. We study the phenomenon of delayed rewards in a theoretical manner by introducing a delay between selecting an action and receiving the reward. Subsequently, we show that an algorithm based on the optimistic principle improves on existing approaches for this setting by eliminating the need for prior knowledge of the delay distribution and relaxing assumptions on the decision set and the delays. This also leads to improving the regret guarantees from O(√(dT)√(d + 𝔼[τ])) to O(d√(T) + d^3/2𝔼[τ]), where 𝔼[τ] denotes the expected delay, d is the dimension and T the time horizon and we have suppressed logarithmic terms. We verify our theoretical results through experiments on simulated data.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset