## Faster rates for policy learning

This article improves the existing proven rates of regret decay in optimal policy estimation. We give a margin-free result showing that the regret decay for estimating a within-class optimal policy is second-order for empirical risk minimizers over Donsker classes, with regret decaying at a faster rate than the standard error of an efficient estimator of the value of an optimal policy. We also give a result from the classification literature that shows that faster regret decay is possible via plug-in estimation provided a margin condition holds. Four examples are considered. In these examples, the regret is expressed in terms of either the mean value or the median value; the number of possible actions is either two or finitely many; and the sampling scheme is either independent and identically distributed or sequential, where the latter represents a contextual bandit sampling scheme.

Référence Bibliographique:https://hal.archives-ouvertes.fr/hal-01511409Auteurs:Alexander R Luedtke and Antoine Chambaz- Identifiez-vous pour poster des commentaires

## On the estimation of the mean of a random vector

We study the problem of estimating the mean of a multivariate distribution based on independent samples. The main result is the proof of existence of an estimator with a non-asymptotic sub-Gaussian performance for all distributions satisfying some mild moment assumptions.

Référence Bibliographique:Emilien Joly and Gábor Lugosi and Roberto Imbuzeiro Oliveira, On the estimation of the mean of a random vector, Electron. J. Statist., 11(1):440-451, 2017Auteurs:Emilien Joly, Gábor Lugosi, Roberto Imbuzeiro Oliveira- Identifiez-vous pour poster des commentaires

## A Minkowski Theorem for Quasicrystals

The aim of this paper is to generalize Minkowski’s theorem. This theorem is usually stated for a centrally symmetric convex body and a lattice both included in R^n. In some situations, one may replace the lattice by a more general set for which a notion of density exists. In this paper, we prove a Minkowski theorem for quasicrystals, which bounds from below the frequency of differences appearing in the quasicrystal and belonging to a centrally symmetric convex body. The last part of the paper is devoted to quite natural applications of this theorem to Diophantine approximation and to discretization of linear maps.

Référence Bibliographique:Pierre-Antoine Guihéneuf and Emilien Joly, A Minkowski Theorem for Quasicrystals, Discrete Comput Geom (2017)Auteurs:Pierre-Antoine Guihéneuf, Emilien Joly- Identifiez-vous pour poster des commentaires

## Targeted sequential design for targeted learning inference of the optimal treatment rule and its mean reward

This article studies the targeted sequential inference of an optimal treatment rule (TR) and its mean reward in the non-exceptional case, i.e., assuming that there is no stratum of the baseline covariates where treatment is neither beneficial nor harmful, and under a companion margin assumption. Our pivotal estimator, whose definition hinges on the targeted minimum loss estimation (TMLE) principle, actually infers the mean reward under the current estimate of the optimal TR. This data-adaptive statistical parameter is worthy of interest on its own. Our main result is a central limit theorem which enables the instruction of confidence intervals on both mean rewards under the current estimate of the optimal TR and under the optimal TR itself. The asymptotic variance of the estimator takes the form of the variance of an efficient influence curve at a limiting distribution, allowing to discuss the efficiency of inference. As a by product, we also derive confidence intervals on two cumulated pseudo-regrets, a key notion in the study of bandits problems. A simulation study illustrates the procedure. One of the cornerstones of the theoretical study is a new maximal inequality for martingales with respect to the uniform entropy integral.

Référence Bibliographique:To appear in Ann. Statist.Auteurs:Antoine Chambaz, Wenjing Zheng and M. J. van der Laan- Identifiez-vous pour poster des commentaires

## Refined Lower Bounds for Adversarial Bandits

We provide new lower bounds on the regret that must be suffered by adversarial bandit algorithms. The new results show that recent upper bounds that either (a) hold with high-probability or (b) depend on the total lossof the best arm or (c) depend on the quadratic variation of the losses, are close to tight. Besides this we prove two impossibility results. First, the existence of a single arm that is optimal in every round cannot improve the regret in the worst case. Second, the regret cannot scale with the effective range of the losses. In contrast, both results are possible in the full-information setting.

Référence Bibliographique:Neural Information Processing Systems n°30 Dec. 2016, ArXiv:1605.07416Auteurs:Sébastien Gerchinovitz, Tor Lattimore- Identifiez-vous pour poster des commentaires

## Conditional quantile sequential estimation for stochastic codes

This paper is devoted to the estimation of conditional quantile, more precisely the quantile of the output of a real stochastic code whose inputs are in R d. In this purpose, we introduce a stochastic algorithm based on Robbins-Monro algorithm and on k-nearest neighbors theory. We propose conditions on the code for that algorithm to be convergent and study the non-asymptotic rate of convergence of the means square error. Finally, we give optimal parameters of the algorithm to obtain the best rate of convergence.

Référence Bibliographique:ArXiv:1508.06505 hal-01187329Auteurs:Tatiana Labopin-Richard, Aurélien Garivier, Fabrice Gamboa- Identifiez-vous pour poster des commentaires

## On Explore-Then-Commit Strategies

We study the problem of minimising regret in two-armed bandit problems with Gaussian rewards. Our objective is to use this simple setting to illustrate that strategies based on an exploration phase (up to a stopping time) followed by exploitation are necessarily suboptimal. The results hold regardless of whether or not the difference in means between the two arms is known. Besides the main message, we also refine existing deviation inequalities, which allow us to design fully sequential strategies with finite-time regret guarantees that are (a) asymptotically optimal as the horizon grows and (b) order-optimal in the minimax sense. Furthermore we provide empirical evidence that the theory also holds in practice and discuss extensions to non-gaussian and multiple-armed case.

Référence Bibliographique:Neural Information Processing Systems n°30 Dec. 2016, ArXiv:1605.08988 hal-01322906Auteurs:Aurélien Garivier, Emilie Kaufmann, Tor Lattimore- Identifiez-vous pour poster des commentaires

## Explore First, Exploit Next: The True Shape of Regret in Bandit Problems

We revisit lower bounds on the regret in the case of multi-armed bandit problems. We obtain non-asymptotic, distribution-dependent bounds and provide straightforward proofs based only on well-known properties of Kullback-Leibler divergences. These bounds show in particular that in an initial phase the regret grows almost linearly, and that the well-known logarithmic growth of the regret only holds in a final phase. The proof techniques come to the essence of the information-theoretic arguments used and they are deprived of all unnecessary complications.

Référence Bibliographique:ArXiv:1602.07182 hal-01276324Auteurs:Aurélien Garivier, Gilles Stoltz, Pierre Ménard- Identifiez-vous pour poster des commentaires

## Maximin Action Identification: A New Bandit Framework for Games

We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower-and upper-confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically. We sketch a lower bound analysis, and possible connections to an optimal algorithm.

Référence Bibliographique:Conference On Learning Theory n°29 Jun. 2016, ArXiv:1602.04676 hal-01273842Auteurs:Aurélien Garivier, Emilie Kaufmann, Wouter M. Koolen- Identifiez-vous pour poster des commentaires

## Optimal Best Arm Identification with Fixed Confidence

We provide a complete characterization of the complexity of best-arm identification in one-parameter bandit problems. We prove a new, tight lower bound on the sample complexity. We propose the 'Track-and-Stop' strategy, which is proved to be asymptotically optimal. It consists in a new sampling rule (which tracks the optimal proportions of arm draws highlighted by the lower bound) and in a stopping rule named after Chernoff, for which we give a new analysis.

Référence Bibliographique:Conference On Learning Theory n°29 Jun. 2016, ArXiv:1602.04589 hal-01273838Auteurs:Aurélien Garivier, Emilie Kaufmann- Identifiez-vous pour poster des commentaires

## Practical targeted learning from large data sets by survey sampling

We address the practical construction of asymptotic confidence intervals for smooth (i.e., pathwise differentiable), real-valued statistical parameters by targeted learning from independent and identically distributed data in contexts where sample size is so large that it poses computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample. We derive a central limit theorem for the targeted minimum loss estimator (TMLE) which enables the construction of the confidence intervals. The inclusion probabilities can be optimized to reduce the asymptotic variance of the TMLE. We illustrate the procedure with two examples where the parameters of interest are variable importance measures of an exposure (binary or continuous) on an outcome. We also conduct a simulation study and comment on its results.

Référence Bibliographique:https://hal.archives-ouvertes.fr/hal-01339538Auteurs:Patrice Bertail, Antoine Chambaz, Emilien Joly- Identifiez-vous pour poster des commentaires

## Asymptotically Optimal Algorithms for Multiple Play Bandits with Partial Feedback

We study a variant of the multi-armed bandit problem with multiple plays in which the user wishes to sample the m out of k arms with the highest expected rewards, but at any given time can only sample ell≤m arms. When ell=m, Thompson sampling was recently shown to be asymptotically efficient. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our new setting where ell may be less than m. We then establish the asymptotic optimality of Thompson sampling for Bernoulli rewards, where our proof technique differs from earlier methods even when ell=m. We also prove the asymptotic optimality of an algorithm based on upper confidence bounds, KL-CUCB, for single-parameter exponential families and bounded, finitely supported rewards, a result which is new for all values of ell.

Référence Bibliographique:https://hal.archives-ouvertes.fr/hal-01338733Auteurs:Alexander Luedtke, Emilie Kaufmann, Antoine Chambaz- Identifiez-vous pour poster des commentaires

## Inference of a non-parametric covariate-adjusted variable importance measure of a continuous exposure

We consider a setting where a real-valued variable of cause X affects a real-valued variable of effect Y in the presence of a context variable W. The objective is to assess to what extent (X, W) influences Y while making as few assumptions as possible on the unknown distribution of (W, X, Y). Based on a user-supplied marginal structural model, our new variable importance measure is non-parametric and context-adjusted. It generalizes the variable importance measure introduced by Chambaz et al. [4]. We show how to infer it by targeted minimum loss estimation (TMLE), conduct a simulation study and present an illustration of its use.

Référence Bibliographique:https://hal.archives-ouvertes.fr/hal-01336324Auteurs:Cabral Chanang Tondji- Identifiez-vous pour poster des commentaires

## Data-adaptive Inference of the Optimal Treatment Rule and its Mean Reward. The Masked Bandit

This article studies the data-adaptive inference of an optimal treatment rule (TR). A TR is an individualized treatment strategy in which treatment assignment for a patient is based on her measured baseline covariates. Eventually, a reward is measured on the patient. We also infer the mean reward under the optimal TR. We do so in the so called non-exceptional case, i.e., assuming that there is no stratum of the baseline covariates where treatment is neither beneficial nor harmful, and under a companion margin assumption. Our pivotal estimator, whose definition hinges on the targeted minimum loss estimation (TMLE) principle, actually infers the mean reward under the current estimate of the optimal TR. This data-adaptive statistical parameter is worthy of interest on its own. Our main result is a central limit theorem which enables the construction of confidence intervals on both mean rewards under the current estimate of the optimal TR and under the optimal TR itself. The asymptotic variance of the estimator takes the form of the variance of an efficient influence curve at a limiting distribution, allowing to discuss the efficiency of inference. As a by product, we also derive confidence intervals on two cumulated pseudo-regrets, a key notion in the study of bandits problems. Seen as two additional data-adaptive statistical parameters, they compare the sum of the rewards actually received during the course of the experiment with, either the sum of the means of the rewards, or the counterfactual rewards we would have obtained if we had used from the start the current estimate of the optimal TR to assign treatment. A simulation study illustrates the procedure. One of the cornerstones of the theoretical study is a new maximal inequality for martingales with respect to the uniform entropy integral.

Référence Bibliographique:https://hal.archives-ouvertes.fr/hal-01301297Auteurs:Antoine Chambaz, Wenjing Zheng, Mark J. van der laan- Identifiez-vous pour poster des commentaires

## A Chaining Algorithm for Online Nonparametric Regression

We consider the problem of online nonparametric regression with arbitrary deterministic sequences. Using ideas from the chaining technique, we design an algorithm that achieves a Dudley-type regret bound similar to the one obtained in a non-constructive fashion by Rakhlin and Sridharan (2014). Our regret bound is expressed in terms of the metric entropy in the sup norm, which yields optimal guarantees when the metric and sequential entropies are of the same order of magnitude. In particular our algorithm is the first one that achieves optimal rates for online regression over Hölder balls. In addition we show for this example how to adapt our chaining algorithm to get a reasonable computational efficiency with similar regret guarantees (up to a log factor).

Référence Bibliographique:Proceedings of The 28th Conference on Learning Theory, pp. 764–796, 2015Auteurs:Pierre Gaillard, Sébastien Gerchinovitz- Identifiez-vous pour poster des commentaires

## Empirial ϕ∗-discrepancies and quasi-empirical likelihood: exponential bounds

We review some recent extensions of the so-called generalized empirical likelihood method, when the Kullback distance is replaced by some general convex divergence. We propose to use, instead of empirical likelihood, some regularized form or quasi-empirical likelihood method, corresponding to a convex combination of Kullback and chi2 discrepancies. We show that for some adequate choice of the weight in this combination, the corresponding quasi-empirical likelihood is Bartlett-correctable. We also establish some non-asymptotic exponential bounds for the confidence regions obtained by using this method. These bounds are derived via bounds for self-normalized sums in the multivariate case obtained in a previous work by the authors. We also show that this kind of results may be extended to process valued infinite dimensional parameters. In this case some known results about self-normalized processes may be used to control the behavior of generalized empirical likelihood.

Référence Bibliographique:à paraître dans ESAIM:ProcAuteurs:Patrice Bertail, Emmanuelle Gautherat, Hugo Harari-Kermadec- Identifiez-vous pour poster des commentaires

## Drawing valid targeted inference when covariate-adjusted response-adaptive RCT meets data-adaptive loss-based estimation, with an application to the LASSO

Adaptive clinical trial design methods have garnered growing attention in the recent years, in large part due to their greater flexibility over their traditional counterparts. One such design is the so-called covariate-adjusted, response-adaptive (CARA) randomized controlled trial (RCT). In a CARA RCT, the treatment randomization schemes are allowed to depend on the patient's pre-treatment covariates, and the investigators have the opportunity to adjust these schemes during the course of the trial based on accruing information (including previous responses), in order to meet a pre-specified optimality criterion, while preserving the validity of the trial in learning its primary study parameter. In this article, we propose a new group-sequential CARA RCT design and corresponding analytical procedure that admits the use of flexible data-adaptive techniques. The proposed design framework can target general adaption optimality criteria that may not have a closed-form solution, thanks to a loss-based approach in defining and estimating the unknown optimal randomization scheme. Both in predicting the conditional response and in constructing the treatment randomization schemes, this framework uses loss-based data-adaptive estimation over general classes of functions (which may change with sample size). Since the randomization adaptation is response-adaptive, this innovative flexibility potentially translates into more effective adaptation towards the optimality criterion. To target the primary study parameter, the proposed analytical method provides robust inference of the parameter, despite arbitrarily mis-specified response models, under the most general settings. Specifically, we establish that, under appropriate entropy conditions on the classes of functions, the resulting sequence of randomization schemes converges to a fixed scheme, and the proposed treatment effect estimator is consistent (even under a mis-specified response model), asymptotically Gaussian, and gives rise to valid confidence intervals of given asymptotic levels. Moreover, the limiting randomization scheme coincides with the unknown optimal randomization scheme when, simultaneously, the response model is correctly specified and the optimal scheme belongs to the limit of the user-supplied classes of randomization schemes. We illustrate the applicability of these general theoretical results with a LASSO-based CARA RCT. In this example, both the response model and the optimal treatment randomization are estimated using a sequence of LASSO logistic models that may increase with sample size. It follows immediately from our general theorems that this LASSO-based CARA RCT converges to a fixed design and yields consistent and asymptotically Gaussian effect estimates, under minimal conditions on the smoothness of the basis functions in the LASSO logistic models. We exemplify the proposed methods with a simulation study.

Référence Bibliographique:https://hal.archives-ouvertes.fr/hal-01180719Auteurs:Wenjing Zheng, Antoine Chambaz, M. J. van der Laan- Identifiez-vous pour poster des commentaires

## On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

The stochastic multi-armed bandit model is a simple abstraction that has proven useful in many different contexts in statistics and machine learning. Whereas the achievable limit in terms of regret minimization is now well known, our aim is to contribute to a better understanding of the performance in terms of identifying the m best arms. We introduce generic notions of complexity for the two dominant frameworks considered in the literature: fixed-budget and fixed-confidence settings. In the fixed-confidence setting, we provide the first known distribution-dependent lower bound on the complexity that involves information-theoretic quantities and holds when m>=1 under general assumptions. In the specific case of two armed-bandits, we derive refined lower bounds in both the fixed-confidence and fixed-budget settings, along with matching algorithms for Gaussian and Bernoulli bandit models. These results show in particular that the complexity of the fixed-budget setting may be smaller than the complexity

of the fixed-confidence setting, contradicting the familiar behavior observed when testing fully specified alternatives. In addition, we also provide improved sequential stopping rules that have guaranteed error probabilities and shorter average running times. The proofs rely on two technical results that are of independent interest : a deviation lemma for self-normalized sums and a novel change of measure inequality for bandit models.Référence Bibliographique:Journal of Machine Learning Research, to appearAuteurs:Emilie Kaufmann, Olivier Cappé, Aurélien Garivier- Identifiez-vous pour poster des commentaires

## Empirical phi star-Divergence Minimizers for Hadamard Differentiable Functionals

We study some extensions of the empirical likelihood method, when the Kullback distance is replaced by some general convex divergence or phi-discrepancy. We show that this generalized empirical likelihood method is asymptotically valid for general Hadamard differentiable functionals.

Référence Bibliographique:Topics in Nonparametric Statistics, Proceedings of the First Conference of the International Society for Nonparametric Statistics, M. G. Akritas, S. N. Lahiri, D. N. Politis, editors, Springer Proceedings in Mathematics & Statistics (74), Chapter 3, 2014Auteurs:Patrice Bertail, Emmanuelle Gautherat, Hugo Harari-Kermadec- Identifiez-vous pour poster des commentaires

## Targeted Covariate-Adjusted Response-Adaptive LASSO-Based Randomized Controlled Trials

We present a new covariate-adjusted response-adaptive randomized controlled trial design and inferential procedure built on top of it. The procedure is targeted in the sense that

*(i)*the sequence of randomization schemes is group-sequentially determined by targeting a user-specified optimal randomization design based on accruing data and,*(ii)*our estimator of the user-specified parameter of interest, seen as the value of a functional evaluated at the true, unknown distribution of the data, is targeted toward it by following the paradigm of targeted minimum loss estimation. We focus for clarity on the case that the parameter of interest is the marginal effect of a binary treatment and that the targeted optimal design is the Neyman allocation, in an effort to produce an estimator with smaller asymptotic variance. For clarity too, we consider the case that the estimator of the conditional outcome given treatment and baseline covariates, a key element of the procedure, is obtained by LASSO regression. Under mild assumptions, the resulting sequence of randomization schemes converges to a limiting design, and the TMLE estimator is consistent and asymptotically Gaussian. Its asymptotic variance can be estimated too. Thus we can build valid confidence intervals of given asymptotic levels. A simulation study confirms our theoretical results.Référence Bibliographique:Modern Adaptive Randomized Clinical Trials, Statistical and Practical Aspects, edited by Oleksandr Sverdlov, Chapman & Hall/CRC Biostatistics Series, Chapter 16, 2015Auteurs:Antoine Chambaz, Mark J. van der Laan, Wenjing Zheng- Identifiez-vous pour poster des commentaires