We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where exploration is induced by perturbing the least-squares approximation of the action-value function. Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where $ d $ are the feature dimension, $ H $ is the horizon, an...
In this paper, we report a performance bound for the widely used least-squares policy iteration (LSP...
We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. ...
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy...
This work studies regret minimization with randomized value functions in reinforcement learning. In ...
We study algorithms using randomized value functions for exploration in reinforcement learning. This...
This paper studies regret minimization with randomized value functions in reinforcement learning. In...
We consider the problem of reinforcement learning with an orientation toward contexts in which an ag...
We study reinforcement learning in an infinite-horizon average-reward setting with linear function a...
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balanc...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...
In online learning problems, exploiting low variance plays an important role in obtaining tight perf...
We consider the problem of reinforcement learning in high-dimensional spaces when the number of feat...
We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximatio...
International audienceIn this paper, we report a performance bound for the widely used least-squares...
We consider the problem of reinforcement learning in high-dimensional spaces when the number of feat...
In this paper, we report a performance bound for the widely used least-squares policy iteration (LSP...
We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. ...
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy...
This work studies regret minimization with randomized value functions in reinforcement learning. In ...
We study algorithms using randomized value functions for exploration in reinforcement learning. This...
This paper studies regret minimization with randomized value functions in reinforcement learning. In...
We consider the problem of reinforcement learning with an orientation toward contexts in which an ag...
We study reinforcement learning in an infinite-horizon average-reward setting with linear function a...
Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balanc...
We study reinforcement learning (RL) with linear function approximation. For episodic time-inhomogen...
In online learning problems, exploiting low variance plays an important role in obtaining tight perf...
We consider the problem of reinforcement learning in high-dimensional spaces when the number of feat...
We study time-inhomogeneous episodic reinforcement learning (RL) under general function approximatio...
International audienceIn this paper, we report a performance bound for the widely used least-squares...
We consider the problem of reinforcement learning in high-dimensional spaces when the number of feat...
In this paper, we report a performance bound for the widely used least-squares policy iteration (LSP...
We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. ...
We study the regret of reinforcement learning from offline data generated by a fixed behavior policy...