13
$\begingroup$

I have been reading the Reinforcement learning: An Introduction by Sutton and Barto (2012) and I have come across the batch learning method. Unfortunately, this method is not very well described in the book and scientific articles regarding batch learning are yet too advanced for me.

Could anyone please elaborate on the method and provide exemplary pseudo-algorithm for this reinforcement learning method?

Batch updating is mentioned in Sutton and Barto on the page 137 in chapter "Optimality of TD(0)", here is the relevant paragraph:

Suppose there is available only a finite amount of experience, say 10 episodes or 100 time steps. In this case, a common approach with incremental learning methods is to present the experience repeatedly until the method converges upon an answer [......] We call this batch updating because updates are made only after processing each complete batch of training data.

After searching for some articles of google, I tried to understand this one, but was not very succesfull.

$\endgroup$
1
  • $\begingroup$ Could you give a reference or quote where you saw the term with more detail (Sutton and Barto book is hundreds of pages, and the 2012 edition is not easily accessible, the most recent is June 2017)? Also, a reference to one of the scientific papers you struggled with? This will help readers identify which batch learning is being referred. I suspect it is not a reinforcement learning method, but a reference to supervised learning. $\endgroup$ Commented Aug 13, 2017 at 19:48

2 Answers 2

10
$\begingroup$

The reference to batch updating is not regarding any new or undescribed reinforcement learning method, but just a subtle re-ordering of how the experience and updates interact. You can use batch updates where experience is in short supply (as opposed to computation time).

Essentially you can just use the standard TD update mechanism, but instead of taking each piece of experience once as it is observed, you store and re-play the trajectories that you have seen so far (i.e. without selecting new actions). You can keep repeating them, almost like policy evaluation in Dynamic Programming (except only using your sampled trajectories), until the value estimates only change by a small amount - or maybe keep repeating for e.g. 10 seconds, until the next pieces of experience arrive.

Obviously this approach has limitations. It can only evaluate actions that have been taken so far, it cannot explore further. It will probably suffer from sampling bias in the value estimates. However, it will represent in some ways the best estimates of state values seen so far, and provided the stochastic parts of the MDP or policy are not too wild, it may be a good choice where gaining experience is costly or time-consuming.

In terms of implementation, you could do something like this (for SARSA(0)):

Initialize $Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)$, arbitrarily, and $Q(\text{terminal-state}, \cdot) = 0$

Repeat:

$\qquad$ Initialize $S$

$\qquad$ Choose $A$ from $S$ using policy derived from $Q$ (e.g., ε-greedy)

$\qquad$ Repeat (for each step of episode):

$\qquad\qquad$ Take action $A$, observe $R, S'$

$\qquad\qquad$ Choose $A'$ from $S'$ using policy derived from $Q$

$\qquad\qquad$ Store five sampled variables $S, A, R, S', A'$ in experience table

$\qquad\qquad$ If experience table is large enough, repeat:

$\qquad\qquad\qquad$ For each $S, A, R, S', A'$ in experience table:

$\qquad\qquad\qquad\qquad Q(S, A) \leftarrow Q(S, A) + \alpha[R + \gamma Q(S', A') − Q(S, A)]$

$\qquad\qquad$ Until computation time runs out, or max changes in Q are small

$\qquad\qquad$ If experience table was used, then clear it (as policy will have changed)

$\qquad\qquad S \leftarrow S'; A \leftarrow A'$

$\qquad\qquad$until $S$ is terminal

There are obviously a few variations of this - you can interleave the experience replay/batch in different places that might work better for episodic or continuous problems. You could keep the online learning updates too, possibly with a different learning rate. If you have an episodic problem, then your experience table really should contain at least one episode end, and definitely so if $\gamma =1$. If you are performing policy evaluation of a fixed policy (as opposed to control where you adjust the policy) then you can keep the older experience, you don't need to cull it.

Furthermore, this is very similar to experience replay approaches that are useful when using estimators such as neural networks. In that case your experience table can be sampled to create mini-batch updates for supervised learning for the estimator. In an off-policy scenario like Q learning you can drop the storage of $A'$ and substitute the maximising $A'$ at the time of replay, also allowing you to keep more history.

$\endgroup$
2
  • $\begingroup$ Thank you very much. The experience replay seems quite similar to the Action replay process (ARP) in the proof of convergence of Q-learning by Watkins (1989). So is it that I sample some transitions from the environment and then replay them to the learning algorithm to obtain as good estimate of state (or state-action) values as possible, given the experience I have? $\endgroup$
    – Jan Vainer
    Commented Aug 14, 2017 at 19:37
  • $\begingroup$ @LordofLuck: yes, the goal is to squeeze the maximum learning out of the available experience. For the batch learning in Sutton & Barto I think they mean something like the pseudo-code in my answer. But in practice with neural-network estimators in recent developments like DQN, the approach is to sample a number of transitions from previous experience and train with them on each step. $\endgroup$ Commented Aug 14, 2017 at 19:59
5
$\begingroup$

This repository provides all the implementations of Sutton book:

https://github.com/dennybritz/reinforcement-learning

And as for your query about the Batch vs Online learning differences perhaps you can refer somewhere else other than Sutton for the answer, perhaps here:

How are weights updated in the batch learning method in neural networks?

But my understanding is that in general, online learning have much better behavior:

https://visualstudiomagazine.com/Articles/2014/08/01/Batch-Training.aspx?p=1

enter image description here

https://en.wikipedia.org/wiki/Online_machine_learning

$\endgroup$
1
  • 2
    $\begingroup$ Thank you very much for the link to the exercises! I will definitely make use of them. Unfortunately, I do not think that the batch updating in reinforcement learning is necessarily connected to neural networks. I have edited the question with exact pages, where the batch updating is mentioned. $\endgroup$
    – Jan Vainer
    Commented Aug 14, 2017 at 18:18

Not the answer you're looking for? Browse other questions tagged or ask your own question.