4
$\begingroup$

In Simple Neural Network back propagation, we normally use one round of forward and back propagation in every iteration. Let's assume, we have one training example for any arbitrary dimensions, and some initial weights. Then using forward propagation, we calculate the predicted output. This predicted output is then used to calculate the total error which is the back propagated to Re-calculate the weights. After recalculating the weights for all the layers, we update the weights for all the layers all at once. It's not like first we update the weights of one layer and then the other, but instead we first recalculate the weights of all layers( layer by layer ) and then update all at once. We can conclude that "

Re-calculating of the weights layer by layer and then updating the weights with recalculated weights all at one for all the layers". Does this makes sense? Is it the right way of weight update using back propagation?

enter image description here Now Let's assume, I have "m" examples instead of just one example. In case of "m" examples, each of these small gradient steps will be taken after one back propagation iteration over all examples "m".

I am confused that in case of "m" examples, this back propagation works on these examples one by one. Like, it first takes the first example and update the weights. Then it takes the second example and calculate the weight again. then it takes the third example and calculate the weight and so on. Then in the last when it has run over all the examples, only then it takes the single step towards optimum point. If that is the case, is there any relation between weights for one example to the weights for another example?? As the BP is is recalculating the weights for each examples in sequence?

$\endgroup$
1
  • $\begingroup$ Can we avoid sequentially going through all examples by using vectorized or matrix based implementation? If yes, How would that avoid it? We still need to compute for each example. $\endgroup$
    – Stupid420
    Commented Sep 11, 2017 at 6:33

3 Answers 3

3
$\begingroup$

A batch of data is taken for feed-forward and "Back-propagation" is performed on the number of examples in that batch. Wights and bias are updated on the basis of change of average error/batch. Then change in weights are updated in the previous wights before performing feed-forward on the next batch of data. A detailed explanation is given in the following book:

http://neuralnetworksanddeeplearning.com/chap2.html

$\endgroup$
0
$\begingroup$

If we look at the error function of batch gradient descent, it calculates the error over all "m" examples.

It's not like it take one example and calculate the error and then take another example and calculate the error. Rather the later would be a case of Stochastic Gradient Descent.

$\endgroup$
0
$\begingroup$

There is an error in the question "Then in the last when it has run over all the examples, only then it takes the single step towards optimum point." Actually, regardless of the choice of batch size, be it a single 'example' or a mini-batch or the entire available set of examples, whenever weights are modified, the loss function is also modified and 'it' takes a step 'towards optimum point'.

What you are seeking is the weight update strategy based on error. There are several different strategies, but they all aim to provide a stable convergence towards global optimal weights at the end of the optimization routine. Remember that we want the weights to have good bias-variance properties, or in other words, they should generalize or not over-fit the training data. Bearing that in mind, it is inadvisable to update weights for a single training sample. Doing so will lead to a very unstable learning routine, i.e. the optimization routine will fail to converge. Instead, we take an aggregate error for multiple training samples and update weights progressively.

My best guess of your source is confusion is misinterpretation of what 'example' means. An example here is a batch of 'm*n' training data points, NOT just 1 data point. The 'm' samples is 'm' batches, each of 'n' training points. The idea behind taking m different random sub-samples with replacement of data is better management of data bias during optimization.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.