Improved Optimistic Algorithm For The Multinomial Logit Contextual Bandit
We consider a dynamic assortment selection problem where the goal is to offer a sequence of assortments of cardinality at most K, out of N items, to minimize the expected cumulative regret (loss of revenue). The feedback is given by a multinomial logit (MNL) choice model. This sequential decision making problem is studied under the MNL contextual bandit framework. The existing algorithms for MNL contexual bandit have frequentist regret guarantees as Õ(κ√(T)), where κ is an instance dependent constant. κ could be arbitrarily large, e.g. exponentially dependent on the model parameters, causing the existing regret guarantees to be substantially loose. We propose an optimistic algorithm with a carefully designed exploration bonus term and show that it enjoys Õ(√(T)) regret. In our bounds, the κ factor only affects the poly-log term and not the leading term of the regret bounds.
READ FULL TEXT