Temporal difference (TD) learning is one of the main foundations of modern reinforcement learning. This thesis studies the use of TD(0), a canonical TD algorithm, to estimate the value function of a given evaluation policy from a batch of data. In this batch setting, we show that TD(0) may converge to an inaccurate value function because the update following an action is weighted according to the number of times that action occurred in the batch -- not the true probability of the action under the evaluation policy. To address this limitation, we introduce policy sampling error corrected-TD(0) (PSEC-TD(0)). PSEC-TD(0) first estimates the empirical distribution of actions in each state in the batch and then uses importance sampling to correct...
Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value ...
We derive an equation for temporal difference learning from statistical principles. Specifically, w...
We provide analytical expressions governing changes to the bias and variance of the lookup table est...
Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for mod...
We derive an equation for temporal difference learning from statistical principles. Specifically, we...
Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-ste...
In Reinforcement learning the updating of the value functions determines the information spreading a...
We consider the off-policy evaluation problem in Markov decision processes with function approximati...
A central challenge to applying many off-policy reinforcement learning algorithms to real world prob...
TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is ...
In this paper we introduce the idea of improving the performance of parametric temporal-difference (...
Many reinforcement learning approaches rely on temporal-difference (TD) learning to learn a critic. ...
Temporal difference (TD) methods are used by reinforcement learning algorithms for predicting future...
The field of reinforcement learning has long sought to design methods thatwill reliably learn contro...
A key aspect of artificial intelligence is the ability to learn from experience. If examples of corr...
Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value ...
We derive an equation for temporal difference learning from statistical principles. Specifically, w...
We provide analytical expressions governing changes to the bias and variance of the lookup table est...
Since the invention of temporal difference (TD) learning (Sutton, 1988), many new algorithms for mod...
We derive an equation for temporal difference learning from statistical principles. Specifically, we...
Temporal difference (TD) methods constitute a class of methods for learning predictions in multi-ste...
In Reinforcement learning the updating of the value functions determines the information spreading a...
We consider the off-policy evaluation problem in Markov decision processes with function approximati...
A central challenge to applying many off-policy reinforcement learning algorithms to real world prob...
TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is ...
In this paper we introduce the idea of improving the performance of parametric temporal-difference (...
Many reinforcement learning approaches rely on temporal-difference (TD) learning to learn a critic. ...
Temporal difference (TD) methods are used by reinforcement learning algorithms for predicting future...
The field of reinforcement learning has long sought to design methods thatwill reliably learn contro...
A key aspect of artificial intelligence is the ability to learn from experience. If examples of corr...
Policy evaluation is an essential step in most reinforcement learning approaches. It yields a value ...
We derive an equation for temporal difference learning from statistical principles. Specifically, w...
We provide analytical expressions governing changes to the bias and variance of the lookup table est...