In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective function; II) approximations in the full gradient computation; and III) a non-stationary sampling process. The result is SVRPG, a stochastic variance-reduced policy gradient algorithm that leverages on importance weights to preserve the unbiasedness of the gradient estimate. Under standard assumptions on the MDP, we provide convergenc...
We propose a novel hybrid stochastic pol-icy gradient estimator by combining an un-biase...
Policy gradient methods are reinforcement learning algorithms that adapt a pa-rameterized policy by ...
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by f...
In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic varian...
In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic varian...
In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic varian...
International audienceIn this paper, we propose a novel reinforcement-learning algorithm consisting ...
International audienceIn this paper, we propose a novel reinforcement-learning algorithm consisting ...
In a reinforcement learning task an agent must learn a policy for performing actions so as to perfo...
We present an in-depth survey of policy gradient methods as they are used in the machine learning co...
We present an in-depth survey of policy gradient methods as they are used in the machine learning co...
The Markov decision process (MDP) formulation used to model many real-world sequential decision maki...
The Markov decision process (MDP) formulation used to model many real-world sequential decision maki...
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper propo...
Optimizing via stochastic gradients is a powerful and exible technique ubiquitously used in machine ...
We propose a novel hybrid stochastic pol-icy gradient estimator by combining an un-biase...
Policy gradient methods are reinforcement learning algorithms that adapt a pa-rameterized policy by ...
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by f...
In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic varian...
In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic varian...
In this paper, we propose a novel reinforcement-learning algorithm consisting in a stochastic varian...
International audienceIn this paper, we propose a novel reinforcement-learning algorithm consisting ...
International audienceIn this paper, we propose a novel reinforcement-learning algorithm consisting ...
In a reinforcement learning task an agent must learn a policy for performing actions so as to perfo...
We present an in-depth survey of policy gradient methods as they are used in the machine learning co...
We present an in-depth survey of policy gradient methods as they are used in the machine learning co...
The Markov decision process (MDP) formulation used to model many real-world sequential decision maki...
The Markov decision process (MDP) formulation used to model many real-world sequential decision maki...
Improving sample efficiency has been a longstanding goal in reinforcement learning. This paper propo...
Optimizing via stochastic gradients is a powerful and exible technique ubiquitously used in machine ...
We propose a novel hybrid stochastic pol-icy gradient estimator by combining an un-biase...
Policy gradient methods are reinforcement learning algorithms that adapt a pa-rameterized policy by ...
Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by f...