Boosted Off-Policy Learning

London, Ben
Lu, Levi
Sandler, Ted
Joachims, Thorsten

Publication date

August 2022

Language

English

Abstract

We investigate boosted ensemble models for off-policy learning from logged bandit feedback. Toward this goal, we propose a new boosting algorithm that directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a "weak" learning condition is satisfied. We further show how the base learner reduces to standard supervised learning problems. Experiments indicate that our algorithm can outperform deep off-policy learning and methods that simply regress on the observed rewards, thereby demonstrating the benefits of both boosting and choosing the right learning objective

Extracted data

We use cookies to provide a better user experience.

Data Protection

Boosted Off-Policy Learning

Abstract

Extracted data

Boosted Off-Policy Learning

Abstract

Extracted data

Related items

Related items