How to set mini-batch size in SGD in keras

Question

I am new to Keras and need your help.

I am training a neural net in Keras and my loss function is Squared Difference b/w net's output and target value.

I want to optimize this using Gradient Descent. After going through some links on the net, I have come to know that there are 3 types of gradient descents used generally:

Single sample gradient descent: Here, the gradient is computed from only one sample every iteration --> Gradient can be noisy.
Batch gradient descent: Here, the gradient is average of gradients computed from ALL the samples in dataset --> Gradient is more general, but intractable for huge datasets.
Mini-batch gradient descent: Similar to Batch GD. Instead of using entire dataset, only a few of the samples (determined by batch_size) are used to compute gradient in every iteration --> Not very noisy and computationally tractable too --> Best of both worlds.

Questions:

I would like to perform Mini-batch Gradient Descent in Keras. How can I do this? Should I use the SGD optimizer?
If SGD is to be used, how do I set the batch_size? There doesn't seem to be a parameter to the SGD function to set batch_size.
```
optimizer = keras.optimizers.SGD(lr=0.01, decay=0.1, momentum=0.1, nesterov=False)
```
There is a batch_size parameter in model.fit() in Keras.
```
history = model.fit(x, y, nb_epoch=num_epochs, batch_size=20, verbose=0, validation_split=0.1)
```
Is this the same as the batch size in Mini-batch Gradient Descent? If not, what does it mean exactly to train on a batch of inputs? Does it mean that 'batch_size' no. of threads run parallely and update the model weights parallely?

If it helps, here's the python code snippet I have written till now.

today · Accepted Answer · 2018-05-13 16:00:19Z

13

Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.

edited May 13, 2018 at 16:00

today

8061 gold badge7 silver badges22 bronze badges

answered Jul 28, 2016 at 2:59

Ernest S Kirubakaran

2811 silver badge8 bronze badges

10

$\begingroup$ He stores in a history object because in keras, the "fit" function returns not only the trained model, but a History object that stores the trained model, training history and many other things. There's some examples in keras examples like that. $\endgroup$
– Ygor de Mello Canalli
Commented Oct 17, 2016 at 22:42

Add a comment |

Tim · Accepted Answer · 2018-10-23 07:01:58Z

4

Taking theoretical considerations aside, given real-life dataset and size of typical modern neural network, it would usually take unreasonably long to train on batches of size one, and you won't have enough RAM and/or GPU memory to train on whole dataset at once. So it is usually not the question "if" mini-batch should be used, but "what size" of batches should you use. The batch_size argument is the number of observations to train on in a single step, usually smaller sizes work better because having regularizing effect. Moreover, often people use more complicated optimizers (e.g. Adam, RMSprop) and other regularization tricks, what makes the relation between model performance, batch size, learning rate and computation time more complicated.

edited Oct 23, 2018 at 7:01

answered Oct 22, 2018 at 19:58

Tim

140k26 gold badges265 silver badges507 bronze badges

$\begingroup$ Thank you for clarifying this. But how would I do data normalization in this way. I know I could normalize all the training and testing data altogether but this will make the mini-batches fed into optimization process no more normalized. $\endgroup$
– Mr.Robot
Commented Jul 31, 2019 at 17:08
$\begingroup$ @Mr.Robot why would you assume that each batch needs to be independently normalized? $\endgroup$
– Tim
Commented Jul 31, 2019 at 18:18
$\begingroup$ I previously read a post saying that training and testing data should be handled separately (i.e. do train-test split first, then process them individually). I thought this would also apply when optimizing using mini-batches. I am suspecting this might be good for convergence? $\endgroup$
– Mr.Robot
Commented Jul 31, 2019 at 18:54
$\begingroup$ @Mr.Robot reductio ad absurdum: with mini-batches of size 1, would you pass as your data only 0's? You can ask it as separate question if you wish. $\endgroup$
– Tim
Commented Jul 31, 2019 at 19:44
$\begingroup$ Thank you for pointing this out! Now I now the issues with my reasoning. $\endgroup$
– Mr.Robot
Commented Jul 31, 2019 at 20:06

Add a comment |

Stack Exchange Network

How to set mini-batch size in SGD in keras

2 Answers 2

Not the answer you're looking for? Browse other questions tagged
neural-networks
python
gradient-descent
keras
stochastic-gradient-descent
or ask your own question.

Linked

Hot Network Questions

How to set mini-batch size in SGD in keras

2 Answers 2

Not the answer you're looking for? Browse other questions tagged neural-networkspythongradient-descentkerasstochastic-gradient-descent or ask your own question.

Linked

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
neural-networks
python
gradient-descent
keras
stochastic-gradient-descent
or ask your own question.