Minimax Regret for Cascading Bandits

03/23/2022
by   Daniel Vial, et al.
0

Cascading bandits model the task of learning to rank K out of L items over n rounds of partial feedback. For this model, the minimax (i.e., gap-free) regret is poorly understood; in particular, the best known lower and upper bounds are Ω(√(nL/K)) and Õ(√(nLK)), respectively. We improve the lower bound to Ω(√(nL)) and show CascadeKL-UCB (which ranks items by their KL-UCB indices) attains it up to log terms. Surprisingly, we also show CascadeUCB1 (which ranks via UCB1) can suffer suboptimal Ω(√(nLK)) regret. This sharply contrasts with standard L-armed bandits, where the corresponding algorithms both achieve the minimax regret √(nL) (up to log terms), and the main advantage of KL-UCB is only to improve constants in the gap-dependent bounds. In essence, this contrast occurs because Pinsker's inequality is tight for hard problems in the L-armed case but loose (by a factor of K) in the cascading case.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset
Success!
Error Icon An error occurred

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro