What is the most statistically efficient way to do off-policy optimization with batch data from bandit feedback? For log data generated by contextual bandit algorithms, we consider offline estimators for the expected reward from a counterfactual policy. Our estimators are shown to have lowest variance in a wide class of estimators, achieving variance reduction relative to standard estimators. We then apply our estimators to improve advertisement design by a major advertisement company. Consistent with the theoretical result, our estimators allow us to improve on the existing bandit algorithm with more statistical confidence compared to a state-of-theart benchmark
In this dissertation we present recent contributions to the problem of optimization under bandit fee...
This manuscript deals with the estimation of the optimal rule and its meanreward in a simple ban...
Inspired by advertising markets, we consider large-scale sequential decision making problems in whic...
What is the most statistically efficient way to do off-policy optimization with batch data from bandit...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
This paper identifies a severe problem of the counterfactual risk estimator typi-cally used in batch...
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
We investigate boosted ensemble models for off-policy learning from logged bandit feedback. Toward t...
International audienceContextual bandit algorithms are essential for solving many real-world interac...
We study the problem of batch learning from bandit feedback in the setting of extremely large action...
Machine and Statistical Learning techniques are used in almost all online advertisement systems. The...
206 pagesRecent advances in reinforcement learning (RL) provide exciting potential for making agents...
We address a practical problem ubiquitous in modern marketing campaigns, in which a central agent tr...
In this dissertation we present recent contributions to the problem of optimization under bandit fee...
This manuscript deals with the estimation of the optimal rule and its meanreward in a simple ban...
Inspired by advertising markets, we consider large-scale sequential decision making problems in whic...
What is the most statistically efficient way to do off-policy optimization with batch data from bandit...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
This paper identifies a severe problem of the counterfactual risk estimator typi-cally used in batch...
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
We investigate boosted ensemble models for off-policy learning from logged bandit feedback. Toward t...
International audienceContextual bandit algorithms are essential for solving many real-world interac...
We study the problem of batch learning from bandit feedback in the setting of extremely large action...
Machine and Statistical Learning techniques are used in almost all online advertisement systems. The...
206 pagesRecent advances in reinforcement learning (RL) provide exciting potential for making agents...
We address a practical problem ubiquitous in modern marketing campaigns, in which a central agent tr...
In this dissertation we present recent contributions to the problem of optimization under bandit fee...
This manuscript deals with the estimation of the optimal rule and its meanreward in a simple ban...
Inspired by advertising markets, we consider large-scale sequential decision making problems in whic...