Thompson Sampling for Combinatorial Semi-Bandits

03/13/2018
by   Siwei Wang, et al.
0

We study the application of the Thompson Sampling (TS) methodology to the stochastic combinatorial multi-armed bandit (CMAB) framework. We analyze the standard TS algorithm for the general CMAB, and obtain the first distribution-dependent regret bound of O(m T / Δ_) for TS under general CMAB, where m is the number of arms, T is the time horizon, and Δ_ is the minimum gap between the expected reward of the optimal solution and any non-optimal solution. We also show that one can not use an approximate oracle in TS algorithm for even MAB problems. Then we expand the analysis to matroid bandit, a special case of CMAB. Finally, we use some experiments to show the comparison of regrets of CUCB and CTS algorithms.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset