When sampling the data, either one at a time (as in online learning), or in mini-batches, there exist gradient descent methods which sample with replacement and without replacement.
For Mini-Batch Gradient Descent, why do the methods that sample with replacement work? Why don't we care, for example, that the same data point could be sampled multiple times, or that some data points from the training set may never get sampled at all?