We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove gen-eralization error bounds that account for the vari-ance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method – calle...
We study how to adapt to smoothly-varying (‘easy’) environments in well-known online learning proble...
This paper introduces the Banditron, a vari-ant of the Perceptron [Rosenblatt, 1958], for the multic...
We present methods for online linear optimization that take advantage of benign (as opposed to worst...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
This paper identifies a severe problem of the counterfactual risk estimator typi-cally used in batch...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
What is the most statistically efficient way to do off-policy optimization with batch data from band...
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log...
Inspired by advertising markets, we consider large-scale sequential decision making problems in whic...
Interactive systems that interact with and learn from user behavior are ubiquitous today. Machine le...
Counterfactual reasoning from logged data has become increasingly important for many applicationssuc...
Interactive systems that interact with and learn from user behavior are ubiquitous today. Machine le...
We study the problem of batch learning from bandit feedback in the setting of extremely large action...
In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlate...
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a ...
We study how to adapt to smoothly-varying (‘easy’) environments in well-known online learning proble...
This paper introduces the Banditron, a vari-ant of the Perceptron [Rosenblatt, 1958], for the multic...
We present methods for online linear optimization that take advantage of benign (as opposed to worst...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
This paper identifies a severe problem of the counterfactual risk estimator typi-cally used in batch...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
What is the most statistically efficient way to do off-policy optimization with batch data from band...
Off-policy evaluation (OPE) attempts to predict the performance of counterfactual policies using log...
Inspired by advertising markets, we consider large-scale sequential decision making problems in whic...
Interactive systems that interact with and learn from user behavior are ubiquitous today. Machine le...
Counterfactual reasoning from logged data has become increasingly important for many applicationssuc...
Interactive systems that interact with and learn from user behavior are ubiquitous today. Machine le...
We study the problem of batch learning from bandit feedback in the setting of extremely large action...
In this thesis we address the multi-armed bandit (MAB) problem with stochastic rewards and correlate...
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a ...
We study how to adapt to smoothly-varying (‘easy’) environments in well-known online learning proble...
This paper introduces the Banditron, a vari-ant of the Perceptron [Rosenblatt, 1958], for the multic...
We present methods for online linear optimization that take advantage of benign (as opposed to worst...