16
$\begingroup$

I am new to Keras and need your help.

I am training a neural net in Keras and my loss function is Squared Difference b/w net's output and target value.

I want to optimize this using Gradient Descent. After going through some links on the net, I have come to know that there are 3 types of gradient descents used generally:

  1. Single sample gradient descent: Here, the gradient is computed from only one sample every iteration --> Gradient can be noisy.
  2. Batch gradient descent: Here, the gradient is average of gradients computed from ALL the samples in dataset --> Gradient is more general, but intractable for huge datasets.
  3. Mini-batch gradient descent: Similar to Batch GD. Instead of using entire dataset, only a few of the samples (determined by batch_size) are used to compute gradient in every iteration --> Not very noisy and computationally tractable too --> Best of both worlds.

Questions:

  1. I would like to perform Mini-batch Gradient Descent in Keras. How can I do this? Should I use the SGD optimizer?
  2. If SGD is to be used, how do I set the batch_size? There doesn't seem to be a parameter to the SGD function to set batch_size.

    optimizer = keras.optimizers.SGD(lr=0.01, decay=0.1, momentum=0.1, nesterov=False)
    
  3. There is a batch_size parameter in model.fit() in Keras.

    history = model.fit(x, y, nb_epoch=num_epochs, batch_size=20, verbose=0, validation_split=0.1)
    

    Is this the same as the batch size in Mini-batch Gradient Descent? If not, what does it mean exactly to train on a batch of inputs? Does it mean that 'batch_size' no. of threads run parallely and update the model weights parallely?

If it helps, here's the python code snippet I have written till now.

$\endgroup$

2 Answers 2

13
$\begingroup$

Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.

$\endgroup$
1
  • 10
    $\begingroup$ He stores in a history object because in keras, the "fit" function returns not only the trained model, but a History object that stores the trained model, training history and many other things. There's some examples in keras examples like that. $\endgroup$ Commented Oct 17, 2016 at 22:42
4
$\begingroup$

Taking theoretical considerations aside, given real-life dataset and size of typical modern neural network, it would usually take unreasonably long to train on batches of size one, and you won't have enough RAM and/or GPU memory to train on whole dataset at once. So it is usually not the question "if" mini-batch should be used, but "what size" of batches should you use. The batch_size argument is the number of observations to train on in a single step, usually smaller sizes work better because having regularizing effect. Moreover, often people use more complicated optimizers (e.g. Adam, RMSprop) and other regularization tricks, what makes the relation between model performance, batch size, learning rate and computation time more complicated.

$\endgroup$
5
  • $\begingroup$ Thank you for clarifying this. But how would I do data normalization in this way. I know I could normalize all the training and testing data altogether but this will make the mini-batches fed into optimization process no more normalized. $\endgroup$
    – Mr.Robot
    Commented Jul 31, 2019 at 17:08
  • $\begingroup$ @Mr.Robot why would you assume that each batch needs to be independently normalized? $\endgroup$
    – Tim
    Commented Jul 31, 2019 at 18:18
  • $\begingroup$ I previously read a post saying that training and testing data should be handled separately (i.e. do train-test split first, then process them individually). I thought this would also apply when optimizing using mini-batches. I am suspecting this might be good for convergence? $\endgroup$
    – Mr.Robot
    Commented Jul 31, 2019 at 18:54
  • $\begingroup$ @Mr.Robot reductio ad absurdum: with mini-batches of size 1, would you pass as your data only 0's? You can ask it as separate question if you wish. $\endgroup$
    – Tim
    Commented Jul 31, 2019 at 19:44
  • $\begingroup$ Thank you for pointing this out! Now I now the issues with my reasoning. $\endgroup$
    – Mr.Robot
    Commented Jul 31, 2019 at 20:06

Not the answer you're looking for? Browse other questions tagged or ask your own question.