An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback–Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of f-divergences, and more concretely α-dive...
We consider the problem of estimating the policy gradient in Partially Observable Markov Decision Pr...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
We consider the problem of minimizing the long term average expected regret of an agent in an online...
An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthes...
ICML 2019International audienceMany recent successful (deep) reinforcement learning algorithms make ...
Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of...
What are the functionals of the reward that can be computed and optimized exactly in Markov Decision...
We consider the problem of learning a policy for a Markov decision process consistent with data capt...
Policy regularization methods such as maximum entropy regularization are widely used in reinforcemen...
Reinforcement learning (RL) is an important field of research in machine learning that is increasing...
We consider an MDP setting in which the reward function is allowed to change during each time step o...
International audienceA new approach to computation of optimal policies for MDP (Markov decision pro...
Abstract Actor-critic algorithms are amongst the most well-studied reinforcement learning algorithms...
AbstractActor-critic algorithms are amongst the most well-studied reinforcement learning algorithms ...
Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and ...
We consider the problem of estimating the policy gradient in Partially Observable Markov Decision Pr...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
We consider the problem of minimizing the long term average expected regret of an agent in an online...
An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthes...
ICML 2019International audienceMany recent successful (deep) reinforcement learning algorithms make ...
Trajectory-Centric Reinforcement Learning and Trajectory Optimization methods optimize a sequence of...
What are the functionals of the reward that can be computed and optimized exactly in Markov Decision...
We consider the problem of learning a policy for a Markov decision process consistent with data capt...
Policy regularization methods such as maximum entropy regularization are widely used in reinforcemen...
Reinforcement learning (RL) is an important field of research in machine learning that is increasing...
We consider an MDP setting in which the reward function is allowed to change during each time step o...
International audienceA new approach to computation of optimal policies for MDP (Markov decision pro...
Abstract Actor-critic algorithms are amongst the most well-studied reinforcement learning algorithms...
AbstractActor-critic algorithms are amongst the most well-studied reinforcement learning algorithms ...
Policy Mirror Descent (PMD) is a general family of algorithms that covers a wide range of novel and ...
We consider the problem of estimating the policy gradient in Partially Observable Markov Decision Pr...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
We consider the problem of minimizing the long term average expected regret of an agent in an online...