If when we talk about the concept of backpropagation and gradient descent you don’t seem to have clear ideas, you’re in the right place. In this article we’ll start with a brief explanation of what they are and what they are used for, and then we’ll move to an example of backpropagation and gradient descent with some calculations that I hope will clarify once and for all any doubts.

The fundamental characteristic of a neural network is represented by its capacity of learning. This is made possible by a continuous optimization of the weights that compose it. This type of optimization is achieved with the application of the gradient descent algorithm which, being based on the calculation of the derivative, allows to adjust the weights in order to reduce the prediction error.

In particularly large networks it is not easy to calculate in analytical form the gradient of the cost function that describes the network, hence the need for numerical algorithms.

The backpropagation is a numerical algorithm for the calculation of the gradient of feedforward networks. It is based on the chain rule, so we know that to calculate the derivative of a compound function it is possible to divide the calculation in more steps:

$\frac{d}{dx}[f(g(x))] = \frac{df}{dg} \cdot \frac{dg}{dx}$

So the derivative of a compound function, even if complex, can be calculated as a composition of partial derivatives. This is done, as we shall see, starting from the output of the network and continuing backwards to the input layer.

## A Simple Neural Network

There is nothing better to clarify the ideas than starting from an example.

Let’s imagine that we have a neural network composed of an input layer, a hidden layer and an output layer.

For the sake of simplicity, the network was purposely reduced to the bone with 3 layers, each consisting of only one node. Each node has its own activation function $\phi(.)$ and a weight $w$ that binds the nodes adjacent to it.

Let’s imagine initializing these weights in the following way:

$\begin{bmatrix}w_{i,h} = 0.15 \\w_{h,o} = 0.3\end{bmatrix}$

In order for a neural network to learn even in complex situations, it is necessary to define the activation functions of each neuron so that they are non-linear functions. Usually the sigmoid is a good choice:

$\phi(x) = \frac{1}{1+e^{-x}}$

In this example, in order to simplify the calculations as much as possible and to be able to concentrate solely on the algorithm itself, we will choose a linear activation function made as follows:

$\phi(x) = x$

Now, imagine that we want to train the network to invert the input parameter, i.e.:

$X = 1 \rightarrow Y = -1$

The final idea would be to learn how to reverse any number:

$f(x) = -1 \cdot x$

## Let’s Calculate the Output

Defined the morphology and the initial parameters, it is time now to calculate the so called forward propagation, that is starting from the input X, we follow the flow of the network to obtain the output Y.

The above calculation is very simple to perform. From left to right we have:

$\phi(a_i) = X$

$\phi(a_h) = \phi(a_i) \cdot w_{i,h}$

$\phi(a_o) = \phi(a_h) \cdot w_{h,o}$

If we now substitute the values of the weights defined above, we quickly realize that the network results in a very different outcome than we intended:

$\phi(a_i) = X = 1$

$\phi(a_h) = \phi(a_i) \cdot w_{i,h} = 1 \cdot 0.15 = 0.15$

$\phi(a_o) = \phi(a_h) \cdot w_{h,o} = 0.15 \cdot 0.3 = 0.045$

Looking at the value of $\phi(a_o)$, it is clear that we are far from returning the correct output (o.045 instead of -1). It is therefore necessary to optimize the weights.

## The Backpropagation

When it comes to optimization, as we have seen, backprop comes to our aid. First, we need to find an error representation that is rigorous.

For this purpose, in this example, we will make use of the mean square error or MSE:

$E = \frac{1}{2} \left ( \phi (a_o) - Y\right )^{2}$

Obviously there are various cost functions to determine the prediction error. In this example we make use of the mean square error which is however a tool used quite frequently.

The MSE calculates the square of the distance between our desired output $Y$ and the actual result produced by the network $\phi(a_o)$.

The constant to multiply was added only because it will come in handy during the calculation of the derivative that we will do shortly.

Having gotten a function that tells us how much the network is erring, we can move on to optimizing it. What we want to do is figure out how the weights $w_{i,h}$ and $w_{h,o}$ interfere in the final result, so we can change their value to minimize the final total error.

What we do is propagate the error of the network from right to left, trying to figure out how to change its parameters to optimize its behavior.

In mathematical terms, we want to calculate:

$\frac{\delta E}{\delta w_{i,h}}$ e $\frac{\delta E}{\delta w_{h,o}}$

To calculate these two quantities, at this point the chain rule intervenes:

$\frac{\delta E}{\delta w_{i,h}} = \frac{\delta E}{\delta \phi(a_o)} \cdot \frac{\delta \phi(a_o)}{\delta \phi(a_h)} \cdot \frac{\delta \phi(a_h)}{\delta w_{i,h}} = (\phi(a_o) - Y) \cdot w_{h,o} \cdot X$

$\frac{\delta E}{\delta w_{h,o}} = \frac{\delta E}{\delta \phi(a_o)} \cdot \frac{\delta \phi(a_o)}{\delta w_{h,o}} = (\phi(a_o) - Y) \cdot \phi(a_h)$

These two quantities, which are nothing more than the gradient of the cost function, are critical for updating the weights:

$w_{i,h}^+ = w_{i,h} - \eta \cdot \frac{\delta E}{\delta w_{i,h}}$

$w_{i,o}^+ = w_{i,o} - \eta \cdot \frac{\delta E}{\delta w_{i,o}}$

What we do is to subtract from the original weights the derivative of the cost function with respect to the weight in question. In this way we obtain a new value of the weight that will reduce the output error of the network.

The parameter $\eta$ is the so called learning rate and is a value that determines how much weight we want to give to the gradient. The higher it is, the more relevant the update of the weights will be.

## Example of Backpropagation and Gradient Descent in Python

Great, now that we have all the formulas we need, let’s move on to doing some numerical calculations.

Network learning is divided into epochs. In each epoch we:

1. calculate the output value $\phi(a_o)$ of the network;
2. determine the new values of the weights $w_{i,h}^+$, $w_{h,o}^+$;
3. start over from step 1.

Let’s do it with Python:

Copy to Clipboard

Launching the script we will see an output like this:

Copy to Clipboard