We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates, a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time T is within C(P) log T of the reward obtained by the optimal policy, where C(P) is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust o...
In this paper, a mapping is developed between the ‘multichain’ and ‘unchain’ linear programs for ave...
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average ...
The precise specification of reward functions for Markov decision processes (MDPs) is often extremel...
We consider the problem of minimizing the long term average expected regret of an agent in an online...
© 2017 AI Access Foundation. All rights reserved. Markov Decision Processes (MDPs) are an effective ...
Markov decision processes (MDPs) have proven to be a useful model for sequential decision- theoretic...
International audienceWe study the role of the representation of state-action value functions in reg...
Bounded parameter Markov Decision Processes (BMDPs) address the issue of dealing with uncertainty in...
We consider a class of sequential decision making problems in the presence of uncertainty, which bel...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suer ...
We consider Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) objectives...
International audienceWe consider reinforcement learning in a discrete, undiscounted, infinite-horiz...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust o...
In this paper, a mapping is developed between the ‘multichain’ and ‘unchain’ linear programs for ave...
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average ...
The precise specification of reward functions for Markov decision processes (MDPs) is often extremel...
We consider the problem of minimizing the long term average expected regret of an agent in an online...
© 2017 AI Access Foundation. All rights reserved. Markov Decision Processes (MDPs) are an effective ...
Markov decision processes (MDPs) have proven to be a useful model for sequential decision- theoretic...
International audienceWe study the role of the representation of state-action value functions in reg...
Bounded parameter Markov Decision Processes (BMDPs) address the issue of dealing with uncertainty in...
We consider a class of sequential decision making problems in the presence of uncertainty, which bel...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suer ...
We consider Markov decision processes (MDPs) with multiple limit-average (or mean-payoff) objectives...
International audienceWe consider reinforcement learning in a discrete, undiscounted, infinite-horiz...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust o...
In this paper, a mapping is developed between the ‘multichain’ and ‘unchain’ linear programs for ave...