1
$\begingroup$

I'm new to machine learning, and am confused about some aspects of stochastic gradient decent.

I've read in several places that, when using vectorized code, the reason that mini-batching in stochastic gradient descent is necessary is to avoid performance degradation when memory limits are hit for too-large mini-batches. But in my code, long before I see any such degradation in processing rate, s/epoch, I see a completely different effect: dramatic decreases in learning rate.

enter image description here

This is no surprise, since for nearly flat processing rate gradient descent rates will be inversely proportional to batch size: s/mini-bach must be increasing. And since (for fixed $\eta$) it is this rate that determines the learning rate, larger mini-batches must — barring some dramatic unexpected and unobserved increase in processing rate with mini-batch size — always be worse than that for smaller batches.

enter image description here

(Moreover any per-epoch analysis of preformance, especially of the training set, will add additional time that grows with mini-batch size.)

It seems then that choosing a mini-batch size depends entirely on selecting a value large enough for vectorization to "kick in" (very small batches perform poorly in that regard) and not going any further than that, except to reduce learning noise.

Is this, in fact, how one chooses a batch size in practice: by picking the lowest possible "well behaved" batch size (e.g. perhaps where the "level off" point is for processing rate) for one’s algorithm, and using that for all networks and training sets? If so, will that number remain the same for my code on all hardware, or is it an unpredictable function of hardware and software? And — more importantly — what of the story that the reason for mini-batching is to avoid memory issues?


This is consistent with what I see across the board, for both training and test accuracies, for a range of network sizes (the number is the number nodes in the hidden layer of a simple 3-layer "beginner" network):

enter image description here enter image description here enter image description here enter image description here enter image description here enter image description here

enter image description here enter image description here enter image description here enter image description here enter image description here

$\endgroup$
1
  • $\begingroup$ I would add one observation. You say "other than to reduce learning noise". I believe this is not a factor at all. If you split a batch into two smaller batches and lower the learning rate appropriately, you will on average get slightly better learning from the two batches: this principle extends right down to single sample batches. It is only the advantage of faster processing in parallel outweighing this that makes it useful to use minibatches of size greater than 1. $\endgroup$
    – Elroch
    Commented May 25, 2018 at 12:33

1 Answer 1

1
$\begingroup$

Deciding on a batch size, or more general, deciding on using batch learning or breaking the training data in min-batches is more towards the very question of how fast the model will converge in total, and how biased/over-fitting my model would end-up being? Of course hardware limitation is another factor. E.g. if you have a training data of size 64GB, then batch learning on normal machines is not an optimum option, you get much better processing performance training on mini-batches which could fit the memory not having to transfer data back and force in RAM in each iteration. This question explains it very well.

You have not mentioned which training data and which cost minimization algorithm you used in your analysis, but all of these factors define how any regime would perform.

But limitations aside, you could look at this problem in a slightly different way. What gives the edge to learning on larger batches is the simple fact that your implementation could vectorize the training set and perform matrix calculations in parallel, then to the extent that your processing unit could parallelize these operations you get shorter epochs (something like numpy in python uses implementations which could do it). Here your algorithm and processing capability will define the optimum batch size.

In stochastic learning you could achieve distributed processing simply by distributing the mini-batches across multiple processing units. Additionally, in stochastic learning, we randomly pick the examples in each iteration to avoid a biased model. It can be proven that this regime could relatively converge to the global optima faster than batch learning (in less iterations) since the gradients are updated after training on less number of examples.

$\endgroup$