I've seen similar conclusion from many discussions, that as the minibatch size gets larger the convergence of SGD actually gets harder/worse, for example this paper and this answer. Also I've heard of people using tricks like small learning rates or batch sizes in the early stage to address this difficulty with large batch sizes.
However it seems counter-intuitive as the average loss of a minibatch can be thought of as an approximation to the expected loss over the data distribution, $$\frac{1}{|X|}\sum_{x\in X} l(x,w)\approx E_{x\sim p_{data}}[l(x,w)]$$ the larger the batch size the more accurate it's supposed to be. Why in practice is it not the case?
Here are some of my (probably wrong) thoughts that try to explain.
The parameters of the model highly depend on each other, when the batch gets too large it will affect too many parameters at once, such that its hard for the parameters to reach a stable inherent dependency? (like the internal covariate shift problem mentioned in the batch normalization paper)
Or when nearly all the parameters are responsible in every iteration they will tend to learn redundant implicit patterns hence reduces the capacity of the model? (I mean say for digit classification problems some patterns should be responsible for dots, some for edges, but when this happens every pattern tries to be responsible for all shapes).
Or is it because the when the batches size gets closer to the scale of the training set, the minibatches can no longer be seen as i.i.d from the data distribution, as there will be a large probability for correlated minibatches?
Update
As pointed out in Benoit Sanchez's answer one important reason is that large minibatches require more computation to complete one update, and most of the analyses use a fix amount of training epochs for comparison.
However this paper (Wilson and Martinez, 2003) shows that a larger batch size is still slightly disadvantageous even given enough amount of training epochs. Is that generally the case?