We settle the sample complexity of policy learning for the maximization of the long run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of $\widetilde O(|S||A|t_{\text{mix}}^2 \epsilon^{-2})$ and a lower bound of $\Omega(|S||A|t_{\text{mix}} \epsilon^{-2})$. In these expressions, $|S|$ and $|A|$ denote the cardinalities of the state and action spaces respectively, $t_{\text{mix}}$ serves as a uniform upper limit for the total variation mixing times, and $\epsilon$ signifies the error tolerance. Therefore, a notable gap of $t_{\text{mix}}$ still remains to be bridged. Our primary contribution is to ...
International audienceWe investigate the classical active pure exploration problem in Markov Decisio...
We consider the problem of designing sample efficient learning algorithms for infinite horizon disco...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...
International audienceWe consider the problem of learning the optimal action-value function in disco...
International audienceWe consider the problem of learning the optimal action-value function in the d...
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finit...
We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with ...
In high stake applications, active experimentation may be considered too risky and thus data are oft...
International audienceWe consider an agent interacting with an environment in a single stream of act...
Abstract We consider the problem of learning the optimal action-value func-tion in discounted-reward...
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Pro...
We consider the problem of learning a policy for a Markov decision process consistent with data capt...
Several recent works have proposed instance-dependent upper bounds on the number of episodes needed ...
In contrast to the advances in characterizing the sample complexity for solving Markov decision proc...
In this paper we consider the problem of computing an $\epsilon$-optimal policy of a discounted Mark...
International audienceWe investigate the classical active pure exploration problem in Markov Decisio...
We consider the problem of designing sample efficient learning algorithms for infinite horizon disco...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...
International audienceWe consider the problem of learning the optimal action-value function in disco...
International audienceWe consider the problem of learning the optimal action-value function in the d...
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finit...
We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with ...
In high stake applications, active experimentation may be considered too risky and thus data are oft...
International audienceWe consider an agent interacting with an environment in a single stream of act...
Abstract We consider the problem of learning the optimal action-value func-tion in discounted-reward...
We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Pro...
We consider the problem of learning a policy for a Markov decision process consistent with data capt...
Several recent works have proposed instance-dependent upper bounds on the number of episodes needed ...
In contrast to the advances in characterizing the sample complexity for solving Markov decision proc...
In this paper we consider the problem of computing an $\epsilon$-optimal policy of a discounted Mark...
International audienceWe investigate the classical active pure exploration problem in Markov Decisio...
We consider the problem of designing sample efficient learning algorithms for infinite horizon disco...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...