10
$\begingroup$

When sampling the data, either one at a time (as in online learning), or in mini-batches, there exist gradient descent methods which sample with replacement and without replacement.

For Mini-Batch Gradient Descent, why do the methods that sample with replacement work? Why don't we care, for example, that the same data point could be sampled multiple times, or that some data points from the training set may never get sampled at all?

$\endgroup$
1
  • $\begingroup$ Very good question (+1). Asking for a friend: could you provide a reference for mini-batch with replacement? $\endgroup$
    – Jim
    Commented Sep 17, 2020 at 21:14

1 Answer 1

1
$\begingroup$

It works (and we don’t care about sampling points multiple times) because it’s an unbiased estimator of the full gradient.

Gradient distributes over summation (and expectation). The expected value of the gradient of a mini-batch, over all possible mini-batches, is the full gradient.

More details are in Leon Bottou’s paper Stochastic Gradient Descent Tricks. Section 2 talks about SGD as an unbiased estimator, and the same argument holds for the minibatch estimator.

$\endgroup$
1
  • $\begingroup$ But usually, if we implement minibatch gd we just shuffle the entire data set once at each epoch. This means, that one sample can appear at a batch only one time. Does this approach lead to a biased estimator? $\endgroup$
    – ado sar
    Commented Oct 11, 2023 at 8:51

Not the answer you're looking for? Browse other questions tagged or ask your own question.