2
\$\begingroup\$

As an exercise we should write a small Neural Network with the following structure: enter image description here

There should be additionally a bias for each layer and sigmoid should be used as the activation function.

The relevant method is the backwards method, which implements backpropagation. It receives results saved from the forward propagation (X is Input, hid_1 / hid_2 the results of the hidden layer after applying the sigmoid, predictions the output), the targets and the paramaters (Weights and Bias for each layer). It should return the gradients for each parameter with respect to the loss function. I included the rest of the code as well in case it is needed.

def initialize(input_dim, hidden1_dim, hidden2_dim, output_dim, batch_size):
    W1 = np.random.randn(hidden1_dim, input_dim) * 0.01
    b1 = np.zeros((hidden1_dim,))
    W2 = np.random.randn(hidden2_dim, hidden1_dim) * 0.01
    b2 = np.zeros((hidden2_dim,))
    W3 = np.random.randn(output_dim, hidden2_dim) * 0.01
    b3 = np.zeros((output_dim,))

    parameters = [W1, b1, W2, b2, W3, b3]
    x = np.random.rand(input_dim, batch_size)
    y = np.random.randn(output_dim, batch_size)
    return parameters, x, y

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def squared_loss(predictions, targets):
    return np.mean(0.5 * np.linalg.norm(predictions - targets, axis=0, ord=2)**2, axis=0)

def deriv_squared_loss(predictions, targets):
    return (predictions - targets) / targets.shape[-1]

def forward(parameters, X):
    W1, b1, W2, b2, W3, b3 = parameters

    hid_1 = sigmoid(W1 @ X + b1[:, np.newaxis])
    hid_2 = sigmoid(W2 @ hid_1 + b2[:, np.newaxis])
    outputs = W3 @ hid_2 + b3[:, np.newaxis]

    return [X, hid_1, hid_2, outputs]

def backward(activations, targets, parameters):
    X, hid_1, hid_2, predictions = activations
    W1, b1, W2, b2, W3, b3 = parameters

    batch_size = X.shape[-1]

    #  ∂L/∂prediction
    dL_dPredictions = deriv_squared_loss(predictions, targets)
    #  ∂L/∂b3 =  ∂L/∂prediction * ∂prediction/∂b3 = ∂L/∂prediction * 1
    dL_db3 = np.dot(dL_dPredictions, np.ones((batch_size,)))
    #  ∂L/∂W3 =  ∂L/∂prediction * ∂prediction/∂W3 = ∂L/∂prediction * hid_2
    dL_dW3 = np.dot(dL_dPredictions, hid_2.T)

    #  ∂L/∂hid_2 =  ∂L/∂prediction * ∂prediction/∂hid_2 = ∂L/∂prediction * ∂prediction/∂sig(W3*X + B3) = ∂L/∂prediction * W3 * sig(W3*X + B3) * (1 - sig(W3*X + B3)) = ∂L/∂prediction * W3 * hid_2 * (1 - hid_2)
    dL_hid_2 =  np.dot(W3.T, dL_dPredictions) * hid_2 * (1 -  hid_2)
    #  ∂L/∂b2 =  ∂L/∂hid_2 *  ∂prediction/∂b3 = ∂L/∂hid_2 * 1
    dL_db2 =  np.dot(dL_hid_2, np.ones((batch_size,)))
    #  ∂L/∂W2 =  ∂L/∂hid_2 *  ∂prediction/∂W2 = ∂L/∂hid_2 * hid_1
    dL_dW2 = np.dot(dL_hid_2, hid_1.T)

    dL_hid_1 =  np.dot(W2.T, dL_hid_2) * hid_1 * (1 -  hid_1)
    dL_db1 =  np.dot(dL_hid_1, np.ones((batch_size,)))
    dL_dW1 = np.dot(dL_hid_1, X.T)

    return [dL_dW1, dL_db1, dL_dW2, dL_db2, dL_dW3, dL_db3]

parameters, X, Y = initialize(input_dim=3, hidden1_dim=4, hidden2_dim=4, output_dim=2, batch_size=5)
activations = forward(parameters, X)
grads = backward(activations, Y, parameters)

The main problem with this is that backpropagation is quite easy in the one dimensional case. You can basically look at each node locally and compute a local gradient, and then multiply it with the global gradient calculated for the previous node. I put these formulas in my code above. But in the multidimensional case it gets more involved. Because matrix multiplication is not commutative, you need to switch sometimes the order or transpose the matrixes. So the code doesn't follow these simple formulas above anymore. You can figure this out by calculating the derivatives with the multidimensional chain rule. Perhaps if you're good you might know the derivates, but I think these are not too obvious. One example is np.dot(W3.T, dL_dPredictions), where you switch the order and transpose the matrix compared to the one dimensional case.

So the question is: How can I improve the code in a way that it's easier to understand and verify. If I look five minutes later at my own code, I have no clue why I was doing exactly these calculations and have to the math again. I suppose there is no way you can understand the code if you have zero knowledge of what is happening mathematically. But I hope I could improve this code to make it readable for someone who is generally familiar with the math, but maybe doesn't know every single formula by heart (=> i.e. me five minutes later)

\$\endgroup\$
4
  • \$\begingroup\$ I'm not sure this question is on topic for this site because you don't seem to master your code. If I can give you a tip however, write the shape of the matrices used in a line in comment to see how to matrix operations change the shape over time. This'll help you understand what's happening (don't keep these comments in production code though unless it's really helpful) \$\endgroup\$
    – IEatBagels
    Commented May 3, 2023 at 15:04
  • \$\begingroup\$ @IEatBagels Well I know why the code works, I've done the math for it on paper (Well no garuntees that it is 100% correct, but in general I know how it works). But I think it's not very intuitive. If you have z = Wa + b, you can of course calculate the derivatives of this with respect to the parameters W, b and the input a and see that the code should be alright. But I wouldn't say those derivatives are so straight forward that the above code makes totally sense the first time you see it. So my question is basically: How to make this more intuitive \$\endgroup\$
    – Leon0402
    Commented May 3, 2023 at 17:22
  • \$\begingroup\$ In my post I described how backpropagation is taught in the one dimensional case. Where everything is commutative and thus you can have this simple multiplication / chain rule pattern. In multidimensional it seems to be the same thing at first glance, but then you need this small non obvious changes like transposing and changing order of multiplications, because matrix multiplication is not commutative. And this makes the code really hard to read in my opinion. \$\endgroup\$
    – Leon0402
    Commented May 3, 2023 at 17:26
  • \$\begingroup\$ @IEatBagels I modified my original post to better reflect this. Making (hopefully) clear that I fail in my code at making the underlying math easy to understand in my code. \$\endgroup\$
    – Leon0402
    Commented May 3, 2023 at 17:42

1 Answer 1

1
\$\begingroup\$

I really like the motivation behind this question.

Is the code hard to read? No! Looks good.

Is the math behind it clear? We feel there's room to improve. Ok, fair enough, let's see if we can make a dent in that.


Moving back to code details for a moment...

def forward(parameters, X):
    W1, b1, W2, b2, W3, b3 = parameters

# call site:
... forward(parameters, X)

This is perfectly clear. Two items:

  1. Consider using a 7-arg signature, so caller says: forward(*parameters, X)
  2. Consider using a 4-arg signature, so each (weight, bias) tuple is grouped together. The unpacks within backward seem similarly on the awkward side.

Reproducible seeding in initialize would be more convenient if we used an rng generator object.

Also, please define some variable for the random init .01 magic number.


Ok, on to the math.

    dL_dPredictions = deriv_squared_loss(predictions, targets)
    dL_db3 = np.dot(dL_dPredictions, np.ones((batch_size,)))
    dL_dW3 = np.dot(dL_dPredictions, hid_2.T)

    dL_hid_2 =  np.dot(W3.T, dL_dPredictions) * hid_2 * (1 -  hid_2)
    dL_db2 =  np.dot(dL_hid_2, np.ones((batch_size,)))
    dL_dW2 = np.dot(dL_hid_2, hid_1.T)

    dL_hid_1 =  np.dot(W2.T, dL_hid_2) * hid_1 * (1 -  hid_1)
    dL_db1 =  np.dot(dL_hid_1, np.ones((batch_size,)))
    dL_dW1 = np.dot(dL_hid_1, X.T)

Honestly, it just doesn't look that bad. Computing dL_dPredictions is the odd man out, and after that we have uniform processing at each layer.

The formula comments are helpful, thank you. A textbook citation wouldn't hurt.

I have two suggestions.

  1. Extract a per-layer helper that you call repeatedly. Consider generalizing to N hidden layers.
  2. Verify numerically that the per-layer computation took us where we want to go.

For that last one, set a verify flag during debugging, propose a computed result, and calculate that forward error at this level we be "small" once we adopt that result.

You could iterate through several epsilon random perturbations of input, and verify the calculation always sends us back within small neighborhood of the right place. Or examine the top eigenvector directions.

If the helper can see what happened this time and previous time, it could verify that convergence is progressing as we anticipate.

I am essentially advocating this sort of approach:

def newton_raphson(n: float) -> float:
    """Return sqrt(n)."""
    ...  # loop until convergence
    assert abs(root * root - n) < epsilon
    return root

The looping part might not be the easiest thing to reason about. But there's a property we desire for the answer, and it's easy to verify that property by comparing to a constant or perhaps through relative error.


This code achieves its design goals. Extracting a per-level helper is likely to improve readability and aid the verification of numeric stability.

I would be willing to delegate or accept maintenance tasks on this codebase.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.