International audienceWe consider reinforcement learning in a discrete, undiscounted, infinite-horizon Markov Decision Problem (MDP) under the average reward criterion, and focus on the minimization of the regret with respect to an optimal policy, when the learner does not know the rewards nor the transitions of the MDP. In light of their success at regret minimization in multi-armed bandits, popular bandit strategies, such as the optimistic UCB, KL-UCB or the Bayesian Thompson sampling strategy, have been extended to the MDP setup. Despite some key successes, existing strategies for solving this problem either fail to be provably asymptotically optimal, or suffer from prohibitive burn-in phase and computational complexity when implemented ...
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy...
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average ...
Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the...
We consider a class of sequential decision making problems in the presence of uncertainty, which bel...
This thesis investigates sequential decision making tasks that fall in the framework of reinforcemen...
We consider the problem of minimizing the long term average expected regret of an agent in an online...
International audienceThe problem of reinforcement learning in an unknown and discrete Markov Decisi...
Reinforcement learning (RL) has gained an increasing interest in recent years, being expected to del...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
We consider an agent interacting with an en-vironment in a single stream of actions, ob-servations, ...
Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for rei...
International audienceWe consider the restless Markov bandit problem, in which the state of each arm...
We study online reinforcement learning in linear Markov decision processes with adversarial losses a...
We consider the restless Markov bandit problem, in which the state of each arm evolves according to ...
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suer ...
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy...
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average ...
Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the...
We consider a class of sequential decision making problems in the presence of uncertainty, which bel...
This thesis investigates sequential decision making tasks that fall in the framework of reinforcemen...
We consider the problem of minimizing the long term average expected regret of an agent in an online...
International audienceThe problem of reinforcement learning in an unknown and discrete Markov Decisi...
Reinforcement learning (RL) has gained an increasing interest in recent years, being expected to del...
We consider an agent interacting with an environment in a single stream of actions, observations, an...
We consider an agent interacting with an en-vironment in a single stream of actions, ob-servations, ...
Information-directed sampling (IDS) has revealed its potential as a data-efficient algorithm for rei...
International audienceWe consider the restless Markov bandit problem, in which the state of each arm...
We study online reinforcement learning in linear Markov decision processes with adversarial losses a...
We consider the restless Markov bandit problem, in which the state of each arm evolves according to ...
Any reinforcement learning algorithm that applies to all Markov decision processes (MDPs) will suer ...
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy...
We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average ...
Abstract. Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the...