Vector Optimization with Stochastic Bandit Feedback
We introduce vector optimization problems with stochastic bandit feedback, which extends the best arm identification problem to vector-valued rewards. We consider K designs, with multi-dimensional mean reward vectors, which are partially ordered according to a polyhedral ordering cone C. This generalizes the concept of Pareto set in multi-objective optimization and allows different sets of preferences of decision-makers to be encoded by C. Different than prior work, we define approximations of the Pareto set based on direction-free covering and gap notions. We study the setting where an evaluation of each design yields a noisy observation of the mean reward vector. Under subgaussian noise assumption, we investigate the sample complexity of the naïve elimination algorithm in an (ϵ,δ)-PAC setting, where the goal is to identify an (ϵ,δ)-PAC Pareto set with the minimum number of design evaluations. In particular, we identify cone-dependent geometric conditions on the deviations of empirical reward vectors from their mean under which the Pareto front can be approximated accurately. We run experiments to verify our theoretical results and illustrate how C and sampling budget affect the Pareto set, returned (ϵ,δ)-PAC Pareto set and the success of identification.
READ FULL TEXT