2
$\begingroup$

Dealing with a very basic neural language model: 3 words of context, vector size 100, one hidden layer size 200, vocabulary size 1000, predicting the next word with a softmax output layer.

Previously I have trained similar models with a stochastic (online) gradient descent - updating the parameters after each training example. This time, I wanted to try batch learning to speed up the matrix operations. However, the system is getting a big drop in accuracy (a rise in perplexity) going from batch size 1 to 2, and it gets even worse with bigger batches.

Training on 200K words for 10 epochs, the perplexities are as follows:

Batch size    Perplexity
1             61.13
2             189.69
10            566.53

Batch sizes 1 and 2 converge to these values, batch size 10 actually diverges. Also tried halving the learning rate after each epoch - this gave 415.92 with batch size 10.

Initially, I was hoping that I just have a bug somewhere, but as far as I can tell that's not the case. I even created batches that are full of the same example, dividing the learning rate by batch size, and the system gave the same result as a stochastic training process.

I expected batch training to get a small increase in perplexity due to parameters being updated less frequently. But also thought that this would be offset by the benefit of averaging out parameter updates before applying them.

Has anyone compared batch vs stochastic training on a feedforward neural network LM? Does such a result make sense on this task? Is there perhaps something that should be handled differently when batch training? Or do you think I still have a sneaky bug somewhere? Many thanks for suggestions!

$\endgroup$

2 Answers 2

1
$\begingroup$

Whenever you change the model configuration, you'll probably have to change the learning rate, any momentum parameters, and the number of epochs you use for training. I wouldn't be surprised if the best number of epochs, best momentum parameters and the best learning rate were different for each batch size.

Additionally, if your objective function is a sum of losses instead of a mean of losses, increasing the batch size has the implicit effect of increasing the learning rate. See: Mean or sum of gradients for weight updates in SGD

$\endgroup$
0
$\begingroup$

Theoretically, that should not be the case. if you have 100 patterns for example, why dont you try your batch code with batch size of 1 and repeat for 100 times. Will you get the same results as with the online training? If yes than your code is sound.

$\endgroup$
2
  • $\begingroup$ The results I showed above are given with the same code actually. The first experiment uses batch size of 1, which is the same as online training. I have an alternative LM implementation that I used to check that this result is valid, but I don't have an alternative batch learning implementation unfortunately. $\endgroup$
    – Marek
    Commented Dec 9, 2014 at 15:36
  • $\begingroup$ At least it's not obvious that the code is unsound in that case. $\endgroup$
    – danijar
    Commented May 21, 2016 at 1:31

Not the answer you're looking for? Browse other questions tagged or ask your own question.