What is Backpropagation?
Backpropagation, short for backward propagation of errors
, is a widely used method for calculating derivatives inside deep feedforward neural networks. Backpropagation forms an important part of a number of supervised learning algorithms for training feedforward neural networks, such as stochastic gradient descent.
When training a neural network by gradient descent, a loss function is calculated, which represents how far the network's predictions are from the true labels. Backpropagation allows us to calculate the gradient of the loss function with respect to each of the weights of the network. This enables every weight to be updated individually to gradually reduce the loss function over many training iterations.
Backpropagation involves the calculation of the gradient proceeding backwards through the feedforward network from the last layer through to the first. To calculate the gradient at a particular layer, the gradients of all following layers are combined via the chain rule of calculus.
The backpropagation algorithm is key to supervised learning of deep neural networks and has enabled the recent surge in popularity of deep learning algorithms since the early 2000s.
Feedforward Neural Network Definition
Let us consider a multilayer feedforward neural network with N layers.
The output of the first hidden layer is given by
Feedforward neural network first layer formula
and the output of the second layer is given by
Feedforward neural network second layer formula
And so on, each layer receives the previous layer’s output as input. The final layer’s output is denoted :
Feedforward neural network last layer formula
Feedforward Neural Network Formula Symbols Explained
|The output of hidden layer i
|The hidden weights matrix in layer i
The input vector to the neural network
|The number of layers in the neural network
|The bias matrix in layer i
Loss function for backpropagation
When the feedforward network accepts an input x and passes it through the layers to produce an output, information flows forward through the network. This is called forward propagation.
During supervised learning, the output is compared to the label vector to give a loss function, also called a cost function, which represents how good the network is at making predictions:
Loss function used for backpropagation
The loss function returns a low value when the network output is close to the label, and a high value when they are different. So at the start of training, the loss function will be very large, and a fully trained model should have a small loss function, when the training dataset is passed through the network. Examples of loss functions include the cross-entropy loss, the cosine similarity function, and the hinge loss.
During training, the objective is to reduce the loss function on the training dataset as much as possible. This means that the network weights must gradually be adjusted in order for C to be reduced.
This means that we must calculate the derivative of C with respect to every weight in the network:
Derivative of cost function needed for backpropagation
Backpropagation loss function formula symbols explained
|The weight of the network going from node j in layer i - 1 to node k in layer i
Calculating gradients with the chain rule
Since a neural network has many layers, the derivative of C at a point in the middle of the network may be very far removed from the loss function, which is calculated after the last layer.
In fact, C depends on the weight values via a chain of many functions. We can use the chain rule of calculus to calculate its derivate. The chain rule tells us that for a function z depending on y, where y depends on x, the derivate of z with respect to x is given by:
The chain rule of calculus
Computing gradients more efficiently with the backpropagation algorithm
Each component of the derivative of C with respect to each weight in the network can be calculated individually using the chain rule. However, it would be extremely inefficient to do this separately for each weight.
The backpropagation algorithm involves first calculating the derivates at layer N, that is the last layer. These derivatives are an ingredient in the chain rule formula for layer N - 1, so they can be saved and re-used for the second-to-last layer. And so in backpropagation we work our way backwards through the network from the last layer to the first layer, each time using the last derivative calculations via the chain rule to obtain the derivatives for the current layer.
In this way, the backpropagation algorithm allows us to efficiently calculate the gradient with respect to each weight by avoiding duplicate calculations.
Example Calculation of Backpropagation: Feedforward network with two hidden layers and sigmoid loss
Defining a feedforward neural network as a computational graph
Let us consider that we are training a simple feedforward neural network with two hidden layers. We can represent the network during training as the following computational graph:
So every training example enters the network as a pair (x, y), where x is the observation, and y is the label. The loss function C is calculated from the output and the label y. C is to be minimized during training.
Expressing the feedforward neural network as a combination of functions
Let us simplify and set the bias values to zero, and treat the vectors as scalars, to make the calculus more concise. This means our network has two parameters to train, and .
Then the output of the first hidden layer is:
The output of the second hidden layer is:
The output of the final layer is:
And finally, let us choose the simple mean squared error function as our loss function:
and let us set the activation functions in both hidden layers to the sigmoid function:
Derivatives in the network
It will be useful to know in advance the derivatives of the loss function and the activation (sigmoid) function:
Using backpropagation, we will first calculate , then, and then , working backwards through the network.
We can express the loss function explicitly as a function of all the weights in the network by substituting in the expression for each layer:
Backpropagation step 1: Calculating the gradient in the third and final layer
First, we want to calculate the gradient of the last weight in the network (layer 3). Applying the chain rule and working backwards in the computational graph, we get:
Backpropagation step 2: Calculating the gradient in the second (penultimate) layer
Next, we will calculate the gradient in layer 2. Since C is now two steps away from layer 2, we have to use the chain rule twice:
Note that the first term in the chain rule expression is the same as the first term in the expression for layer 3.
Backpropagation step 3: Calculating the gradient in the first layer
Finally, we can calculate the gradient with respect to the weight in layer 1, this time using another step of the chain rule.
The first two terms in the chain rule expression for layer 1 are shared with the gradient calculation for layer 2.
This means that computationally, it is not necessary to re-calculate the entire expression. Only the terms that are particular to the current layer must be evaluated. The terms that are common to the previous layers can be recycled.
In this way, the backpropagation algorithm is extremely efficient, compared to a naive approach, which would involve evaluating the chain rule for every weight in the network individually.
Once the gradients are calculated, it would be normal to update all the weights in the network with an aim of reducing C. There are a number of algorithms to achieve this, and the most well-known is stochastic gradient descent.
Standard Backpropagation vs. Backpropagation Through Time
Backpropagation relies on the ability to express an entire neural network as a function of a function of a function... and so on, allowing the chain rule to be applied recursively.
This cannot be applied if the neural network cannot be reduced to a single expression of compounded functions - in other words, if it cannot be expressed as a directed acyclic graph.
A recurrent neural network processes an incoming time series, and the output of a node at one point in time is fed back into the network at the following time point. This means that a recurrent neural network cannot be expressed as a directed acyclic graph, since it contains cycles.
However, it is possible to 'unroll' a recurrent neural network and view it like a feedforward neural network. Each timestep is represented as a single copy of the original neural network.
Unrolling a recurrent neural network in order to represent it as a feedforward neural network for backpropagation through time
Because backpropagation through time involves duplicating the network, it can produce a large feedforward neural network which is hard to train, with many opportunities for the backpropagation algorithm to get stuck in local optima.
Furthermore, interactions between inputs that are far apart in time can be hard for the network to learn, as the gradient contributions from the interaction become diminishingly small in comparison to local effects. This is known as the vanishing gradient problem, and can be addressed by choosing ReLU activation functions, and introducing regularization into the network.
Applications of Backpropagation
Backpropagation and its variants such as backpropagation through time are widely used for training nearly all kinds of neural networks, and have enabled the recent surge in popularity of deep learning. A small selection of example applications of backpropagation are presented below.
Backpropagation in convolutional neural networks for face recognition
Convolutional neural networks are the standard deep learning technique for image processing and image recognition, and are often trained with the backpropagation algorithm.
In 2015, Parkhi, Vidaldi and Zisserman described a technique for building a face recognizer. They used a convolutional neural network with 18 layers, and a database of celebrity faces.
Initially, the network was trained using backpropagation through all the 18 layers. Images were passed into the network in batches, the loss function was calculated, and the gradients were calculated first for layer 18, working back towards layer 1. After each batch of images, the network weights were updated.
To refine the network to be able to distinguish the nuances of human faces, the researchers ran an extra training stage for layer 18 only, once the backpropagation had run for all 18 layers. The researchers chose a loss function called a triplet loss. The neural network receives an input of three celebrity face images at once, for example, two images of Matt Damon and one image of Brad Pitt. The loss function penalizes the network if it decides that two images of the same person are different, and also penalizes the network for classifying images of different people as similar. Over time, triplets of three images are passed through the network and the loss function is calculated, and the weights of the last layer are updated.
Backpropagation for speech recognition
The backpropagation algorithm has been applied for speech recognition. An example implementation of a speech recognition system for English and Japanese, able to run on embedded devices, was developed by the Sony Corporation of Japan. The system is designed to listen for a limited number of commands by a user.
In this implementation, an incoming sound signal is split into windows of time, and a Fast Fourier Transform is applied. The sound intensity at different frequencies is taken as a feature and input into a neural network consisting of five layers. The researchers chose a softmax cross-entropy loss function, and were able to apply backpropagation to train the five layers to understand Japanese commands. They were then able to switch the network to train on English sound recordings, and were able to adapt the system to recognize commands in English. This is an example of transfer learning: a machine learning model can be trained for one task, and then re-trained and adapted for a new task.
Augustin-Louis Cauchy (1789-1857), inventor of gradient descent. Image is in the public domain.
In 1847, the French mathematician Baron Augustin-Louis Cauchy developed a method of gradient descent for solving simultaneous equations. He was interested in solving astronomic calculations in many variables, and had the idea of taking the derivative of a function and taking small steps to minimize an error term.
Over the following century, gradient descent methods were used across disciplines to solve difficult problems numerically, where an exact algebraic solution would have been impossible or computationally intractable.
In 1970, the Finnish master's student Seppo Linnainmaa described an efficient algorithm for error backpropagation in sparsely connected networks in his master's thesis at the University of Helsinki, although he did not refer to neural networks specifically.
In 1986, the American psychologist David Rumelhart and his colleagues published an influential paper applying Linnainmaa's backpropagation algorithm to multi-layer neural networks. The following years saw several breakthroughs building on the new algorithm, such as Yann LeCun's 1989 paper applying backpropagation in convolutional neural networks for handwritten digit recognition.
In the 1980s, various researchers independently derived backpropagation through time, in order to enable training of recurrent neural networks.
In recent years deep neural networks have become ubiquitous and backpropagation is very important for efficient training. Although the algorithm has been modified to be parallelized and run easily on multiple GPUs, Linnainmaa and Rumelhart's original backpropagation algorithm forms the backbone of all deep learning-based AI today.
Murphy, Machine Learning: A Probabilistic Perspective (2012)
Goodfellow et al, Deep Learning (2016)
Cauchy, Méthode générale pour la résolution des systèmes d’équations simultanées (1847)
Lecun, Backpropagation Applied to Handwritten Zip Code Recognition (1989)
Parkhi et al,
Deep Face Recognition(2015)
Tsunoo et al (Sony Corporation, Japan), End-to-end Adaptation with Backpropagation through WFST for On-device Speech Recognition System (2019)