We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback. In this sequential decision making problem, the learner cannot directly observe its rewards, but instead sees the ones obtained by another unknown policy run in parallel (behavior policy). Instead of a standard exploration-exploitation dilemma, the learner has to face another challenge in this setting: due to limited observations outside of their control, the learner may not be able to estimate the value of each policy equally well. To address this issue, we propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior pol...
We consider the problem of learning to play a repeated multi-agent game with an unknown reward funct...
The greedy algorithm is extensively studied in the field of combinatorial optimiza-tion for decades....
Abstract—We consider online learning in finite stochastic Markovian environments where in each time ...
We study online reinforcement learning in linear Markov decision processes with adversarial losses a...
International audienceWe consider online learning in finite stochastic Markovian environments where ...
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this pape...
We introduce and study a partial-information model of online learning, where a decision maker repeat...
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds...
Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine an...
International audienceWe consider online learning problems under a a partial observability model cap...
Online learning or sequential decision making is formally defined as a repeated game between an adve...
In this paper, we study the problem of efficient online reinforcement learning in the infinite horiz...
Online learning algorithms are designed to learn even when their input is generated by an adversary....
We consider an adversarial online learning setting where a decision maker can choose an action in ev...
AbstractWe study a partial-information online-learning problem where actions are restricted to noisy...
We consider the problem of learning to play a repeated multi-agent game with an unknown reward funct...
The greedy algorithm is extensively studied in the field of combinatorial optimiza-tion for decades....
Abstract—We consider online learning in finite stochastic Markovian environments where in each time ...
We study online reinforcement learning in linear Markov decision processes with adversarial losses a...
International audienceWe consider online learning in finite stochastic Markovian environments where ...
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this pape...
We introduce and study a partial-information model of online learning, where a decision maker repeat...
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds...
Evaluating the performance of an ongoing policy plays a vital role in many areas such as medicine an...
International audienceWe consider online learning problems under a a partial observability model cap...
Online learning or sequential decision making is formally defined as a repeated game between an adve...
In this paper, we study the problem of efficient online reinforcement learning in the infinite horiz...
Online learning algorithms are designed to learn even when their input is generated by an adversary....
We consider an adversarial online learning setting where a decision maker can choose an action in ev...
AbstractWe study a partial-information online-learning problem where actions are restricted to noisy...
We consider the problem of learning to play a repeated multi-agent game with an unknown reward funct...
The greedy algorithm is extensively studied in the field of combinatorial optimiza-tion for decades....
Abstract—We consider online learning in finite stochastic Markovian environments where in each time ...