Mini-Batch Gradient Descent - Why does sampling with replacement work?

Question

When sampling the data, either one at a time (as in online learning), or in mini-batches, there exist gradient descent methods which sample with replacement and without replacement.

For Mini-Batch Gradient Descent, why do the methods that sample with replacement work? Why don't we care, for example, that the same data point could be sampled multiple times, or that some data points from the training set may never get sampled at all?

Very good question (+1). Asking for a friend: could you provide a reference for mini-batch with replacement? — Jim, Commented Sep 17, 2020 at 21:14

Arya McCarthy · Accepted Answer · 2022-05-19 04:50:51Z

1

It works (and we don’t care about sampling points multiple times) because it’s an unbiased estimator of the full gradient.

Gradient distributes over summation (and expectation). The expected value of the gradient of a mini-batch, over all possible mini-batches, is the full gradient.

More details are in Leon Bottou’s paper Stochastic Gradient Descent Tricks. Section 2 talks about SGD as an unbiased estimator, and the same argument holds for the minibatch estimator.

edited May 19, 2022 at 4:50

answered May 19, 2022 at 3:39

Arya McCarthy

8,8361 gold badge24 silver badges54 bronze badges

$\begingroup$ But usually, if we implement minibatch gd we just shuffle the entire data set once at each epoch. This means, that one sample can appear at a batch only one time. Does this approach lead to a biased estimator? $\endgroup$
– ado sar
Commented Oct 11, 2023 at 8:51

Add a comment |

Stack Exchange Network

Mini-Batch Gradient Descent - Why does sampling with replacement work?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
machine-learning
sampling
gradient-descent
or ask your own question.

Hot Network Questions

Mini-Batch Gradient Descent - Why does sampling with replacement work?

1 Answer 1

Not the answer you're looking for? Browse other questions tagged machine-learningsamplinggradient-descent or ask your own question.

Related

Hot Network Questions

Not the answer you're looking for? Browse other questions tagged
machine-learning
sampling
gradient-descent
or ask your own question.