Asymptotically Optimal Algorithms for Multiple Play Bandits with Partial Feedback

We study a variant of the multi-armed bandit problem with multiple plays in which the user wishes to sample the m out of k arms with the highest expected rewards, but at any given time can only sample ell≤m arms. When ell=m, Thompson sampling was recently shown to be asymptotically efficient. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our new setting where ell may be less than m. We then establish the asymptotic optimality of Thompson sampling for Bernoulli rewards, where our proof technique differs from earlier methods even when ell=m. We also prove the asymptotic optimality of an algorithm based on upper confidence bounds, KL-CUCB, for single-parameter exponential families and bounded, finitely supported rewards, a result which is new for all values of ell.

Référence Bibliographique:
Alexander Luedtke, Emilie Kaufmann, Antoine Chambaz