The most relevant problems in discounted reinforcement learning involve estimating the mean of a function under the stationary distribution of a Markov reward process, such as the expected return in policy evaluation, or the policy gradient in policy optimization. In practice, these estimates are produced through a finite-horizon episodic sampling, which neglects the mixing properties of the Markov process. It is mostly unclear how this mismatch between the practical and the ideal setting affects the estimation, and the literature lacks a formal study on the pitfalls of episodic sampling, and how to do it optimally. In this paper, we present a minimax lower bound on the discounted mean estimation problem that explicitly connects the estimat...
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balanc...
Abstract. Reinforcement learning means finding the optimal course of action in Markovian environment...
We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for ...
The most relevant problems in discounted reinforcement learning involve estimating the mean of a fun...
At the working heart of policy iteration algorithms commonly used and studied in the discounted sett...
We study learning algorithms for the classical Markovian bandit problem with discount. We explain ho...
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cum...
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finit...
International audienceWe consider the problem of learning the optimal action-value function in disco...
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Pro...
Abstract We consider the problem of learning the optimal action-value func-tion in discounted-reward...
International audienceWe consider the problem of learning the optimal action-value function in the d...
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning usi...
peer reviewedIn this paper, we propose an extension to the policy gradient algorithms by allowing st...
We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition proba...
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balanc...
Abstract. Reinforcement learning means finding the optimal course of action in Markovian environment...
We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for ...
The most relevant problems in discounted reinforcement learning involve estimating the mean of a fun...
At the working heart of policy iteration algorithms commonly used and studied in the discounted sett...
We study learning algorithms for the classical Markovian bandit problem with discount. We explain ho...
In Reinforcement Learning (RL), an agent acts in an unknown environment to maximize the expected cum...
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finit...
International audienceWe consider the problem of learning the optimal action-value function in disco...
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Pro...
Abstract We consider the problem of learning the optimal action-value func-tion in discounted-reward...
International audienceWe consider the problem of learning the optimal action-value function in the d...
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning usi...
peer reviewedIn this paper, we propose an extension to the policy gradient algorithms by allowing st...
We consider the infinite-horizon linear Markov Decision Processes (MDPs), where the transition proba...
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balanc...
Abstract. Reinforcement learning means finding the optimal course of action in Markovian environment...
We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for ...