We investigate boosted ensemble models for off-policy learning from logged bandit feedback. Toward this goal, we propose a new boosting algorithm that directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a "weak" learning condition is satisfied. We further show how the base learner reduces to standard supervised learning problems. Experiments indicate that our algorithm can outperform deep off-policy learning and methods that simply regress on the observed rewards, thereby demonstrating the benefits of both boosting and choosing the right learning objective
The ability to discover optimal behaviour from fixed data sets has the potential to transfer the suc...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
Learning from interaction with the environment -- trying untested actions, observing successes and f...
In this dissertation we develop new methodologies and frameworks to address challenges in offline re...
Many practical applications, such as recommender systems and learning to rank, involve solving multi...
206 pagesRecent advances in reinforcement learning (RL) provide exciting potential for making agents...
Reinforcement learning, as a part of machine learning, is the study of how to compute intelligent be...
What is the most statistically efficient way to do off-policy optimization with batch data from bandit...
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pi...
Reinforcement learning has recently become an active topic in recommender system research, where the...
Improving the sample efficiency of reinforcement learning algorithms requires effective exploration....
Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking met...
We study the problem of online learning in adversarial bandit problems under a partial observability...
Reducing reinforcement learning to supervised learning is a well-studied and effective approach that...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
The ability to discover optimal behaviour from fixed data sets has the potential to transfer the suc...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
Learning from interaction with the environment -- trying untested actions, observing successes and f...
In this dissertation we develop new methodologies and frameworks to address challenges in offline re...
Many practical applications, such as recommender systems and learning to rank, involve solving multi...
206 pagesRecent advances in reinforcement learning (RL) provide exciting potential for making agents...
Reinforcement learning, as a part of machine learning, is the study of how to compute intelligent be...
What is the most statistically efficient way to do off-policy optimization with batch data from bandit...
In this paper we revisit the method of off-policy corrections for reinforcement learning (COP-TD) pi...
Reinforcement learning has recently become an active topic in recommender system research, where the...
Improving the sample efficiency of reinforcement learning algorithms requires effective exploration....
Probabilistic learning to rank (LTR) has been the dominating approach for optimizing the ranking met...
We study the problem of online learning in adversarial bandit problems under a partial observability...
Reducing reinforcement learning to supervised learning is a well-studied and effective approach that...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
The ability to discover optimal behaviour from fixed data sets has the potential to transfer the suc...
We develop a learning principle and an efficient algorithm for batch learning from logged bandit fee...
Learning from interaction with the environment -- trying untested actions, observing successes and f...