I am new to Keras and need your help.
I am training a neural net in Keras and my loss function is Squared Difference b/w net's output and target value.
I want to optimize this using Gradient Descent. After going through some links on the net, I have come to know that there are 3 types of gradient descents used generally:
- Single sample gradient descent: Here, the gradient is computed from only one sample every iteration --> Gradient can be noisy.
- Batch gradient descent: Here, the gradient is average of gradients computed from ALL the samples in dataset --> Gradient is more general, but intractable for huge datasets.
- Mini-batch gradient descent: Similar to Batch GD. Instead of using entire dataset, only a few of the samples (determined by batch_size) are used to compute gradient in every iteration --> Not very noisy and computationally tractable too --> Best of both worlds.
Questions:
- I would like to perform Mini-batch Gradient Descent in Keras. How can I do this? Should I use the SGD optimizer?
If SGD is to be used, how do I set the batch_size? There doesn't seem to be a parameter to the SGD function to set batch_size.
optimizer = keras.optimizers.SGD(lr=0.01, decay=0.1, momentum=0.1, nesterov=False)
There is a batch_size parameter in model.fit() in Keras.
history = model.fit(x, y, nb_epoch=num_epochs, batch_size=20, verbose=0, validation_split=0.1)
Is this the same as the batch size in Mini-batch Gradient Descent? If not, what does it mean exactly to train on a batch of inputs? Does it mean that 'batch_size' no. of threads run parallely and update the model weights parallely?
If it helps, here's the python code snippet I have written till now.