Minimax Regret for Cascading Bandits

03/23/2022

∙

Cascading bandits model the task of learning to rank K out of L items over n rounds of partial feedback. For this model, the minimax (i.e., gap-free) regret is poorly understood; in particular, the best known lower and upper bounds are Ω(√(nL/K)) and Õ(√(nLK)), respectively. We improve the lower bound to Ω(√(nL)) and show CascadeKL-UCB (which ranks items by their KL-UCB indices) attains it up to log terms. Surprisingly, we also show CascadeUCB1 (which ranks via UCB1) can suffer suboptimal Ω(√(nLK)) regret. This sharply contrasts with standard L-armed bandits, where the corresponding algorithms both achieve the minimax regret √(nL) (up to log terms), and the main advantage of KL-UCB is only to improve constants in the gap-dependent bounds. In essence, this contrast occurs because Pinsker's inequality is tight for hard problems in the L-armed case but loose (by a factor of K) in the cascading case.

READ FULL TEXT

Minimax Regret for Cascading Bandits

Sign in with Google

Consider DeepAI Pro