Neural networks and deep learning: online tutorial, chapter 2

Content

In the last chapter, we saw how neural networks can independently learn weights and offsets using the gradient descent algorithm. However, there was a gap in our explanation: we did not discuss the calculation of the gradient of the cost function. And this is a decent gap! In this chapter, I will explain a fast algorithm for calculating such gradients, known as backpropagation.

The first propagation algorithm was invented in the 1970s, but its importance was not fully realized until the famous work of 1986 written by David Rumelhart, Joffrey Hinton and Ronald Williams. The paper describes several neural networks in which backward propagation works much faster than in earlier approaches to learning, because of which it has since been possible to use the neural network to solve previously unsolvable problems. Today, the backpropagation algorithm is the workhorse of learning the neural network.

This chapter contains more mathematics than all the others in the book. If you do not particularly like the mathematics, you may be tempted to skip this chapter and simply treat back distribution as a black box, the details of which you are ready to ignore. Why waste time studying them?

The reason, of course, is in understanding. The back propagation is based on the partial derivative ∂C / ∂w of the cost function C by weight w (or offset b) of the network. The expression shows how quickly the cost changes with changes in weights and offsets. And although this expression is rather complicated, it has its own beauty, because each of its elements has a natural and intuitive interpretation. Therefore, back propagation is not just a fast learning algorithm. He gives us a detailed understanding of how changing weights and offsets changes the whole behavior of the network. And it is worth it to learn the details.
')
Considering all this, if you want to simply flip through this chapter or jump to the next one, no problem. I wrote the rest of the book so that it was understandable, even if we consider the reverse distribution as a black box. Of course, later in the book there will be moments from which I refer to the results of this chapter. But at that moment you should understand the main conclusions, even if you did not follow all the arguments.

For warming up: a fast matrix approach for computing the output of a neural network

Before discussing backpropagation, let's warm up with a fast matrix algorithm to calculate the output of a neural network. We actually met with this algorithm by the end of the previous chapter, but I described it quickly, so it is worth re-examining it in more detail. In particular, it will be a good way to adapt to the record used in back distribution, but in a familiar context.

Let's start with a record that allows us to unambiguously denote weights in the network. We will use w ^l _jk to denote the weight of the connection of the neuron No. k in the layer No. (l-1) with the neuron No. j in the layer No. 1. So, for example, the diagram below shows the weight of the connection of the fourth neuron of the second layer with the second neuron of the third layer:

At first, such a record seems awkward, and requires some effort to get used to. However, it will soon seem simple and natural to you. One of its features is the order of the indices j and k. You might decide that it would be wiser to use j to designate an input neuron, and k for the output, and not vice versa, as we do. The reason for this feature, I will explain below.

We will use similar designations for offsets and network activations. In particular, b ^l _j will denote the displacement of the neuron №j in the layer №l. a ^l _j will denote the activation of neuron №j in the layer №l. The following diagram shows examples of using this record:

With such a record, the activation of a ^l _{j of} neuron No. _j in layer No. l is associated with activation in layer No. (l-1) by the following equation (compare with equation (4) and its discussion in the last chapter):

$a ^ l_j = \ sigma (\ sum_k w ^ l_ {jk} a ^ {l − 1} _k + b ^ l_j) \ tag {23}$

where the sum goes over all neurons k in the layer (l-1). To rewrite this expression in a matrix form, we define a matrix of weights w ^l for each layer l. The elements of the weights matrix are simply weights connected to the layer №l, that is, the element in the row №j and the column №k will be w ^l _jk . Similarly, for each layer l, we define the displacement vector b ^l . You probably guessed how it works - the components of the displacement vector are just the values of b ^l _j , one component for each neuron in the layer №l. And finally, we define the activation vector a ^l , whose components will be the activation a ^l _j .

The final ingredient needed to rewrite (23) is the matrix form of vectorization of the function σ. We casually encountered vectorization in the last chapter — the idea is that we want to apply the function σ to each element of the vector v. We use the obvious entry σ (v) to denote the elementwise use of the function. That is, the components of σ (v) are simply σ (v) _j = σ (v _j ). For example, if we have a function f (x) = x ² , then the vectorized form f gives us

$f (\ begin {bmatrix} 2 \\ 3 \ end {bmatrix}) = \ begin {bmatrix} f (2) \\ f (3) \ end {bmatrix} = \ begin {bmatrix} 4 \\ 9 \ end {bmatrix} \ tag {24}$

that is, vectorized f simply squares each element of the vector.

Given all these forms of writing, equation (23) can be rewritten in a beautiful and compact vectorized form:

$a ^ l = \ sigma (w ^ l a ^ {l − 1} + b ^ l) \ tag {25}$

This expression allows us to more globally look at the connection of activations of one layer with the activations of the previous one: we simply apply the weights matrix to the activations, add the displacement vector and then apply the sigmoid. By the way, it is this record that requires the use of the record w ^l _jk . If we used j to denote the input neuron, and k for the output, we would have to replace the weights matrix in equation (25) with the transposed one. This is a small but annoying change, and we would have lost the simplicity of the statement (and reflection) about the “application of the matrix of weights to activations”. Such a global approach is simpler and more concise (and uses fewer indices!) Than a pioneer one. This is just a way to avoid index hell without losing the accuracy of the notation. This expression is also useful in practice, since most matrix libraries offer fast ways to multiply matrices, add vectors and vectorize. The code in the last chapter directly used this expression to calculate the behavior of the network.

Using equation (25) to calculate a ^l , we calculate the intermediate value z ^l ≡ w ^l a ^{l − 1} + b ^l . This value turns out to be quite useful for naming: we call z ^{l the} weighted input of the neurons of layer No. l. Later we will actively use this weighted input. Equation (25) is sometimes written through the weighted input, as a ^l = σ (z ^l ). It is also worth noting that z ^l has components

$z ^ l_j = \ sum_k w ^ l_ {jk} a ^ {l − 1} _k + b ^ l_j$ , that is, z ^l _j is just a weighted input of the activation function of the neuron j in the layer l.

Two necessary assumptions about the cost function

The goal of backpropagation is to calculate the partial derivatives ∂C / ∂w and ∂C / ∂b of the cost function C for each weight w and offset b of the network. In order for the backward propagation to work, we need to make two main assumptions about the form of the cost function. However, before this it will be useful to present an example of the cost function. We use the quadratic function from the last chapter (equation (6)). In the record from the previous section, it will look like

$C = \ frac {1} {2n} \ sum_x || y (x) - a ^ L (x) || ^ 2 \ tag {26}$

where: n is the total number of training examples; the sum goes in all examples x; y = y (x) is the required output; L denotes the number of layers in the network; a ^L = a ^L (x) is the output vector of network activations when the input is x.

Okay, so what assumptions do we need regarding cost function C to apply backward propagation? The first is that the cost function can be written as the average C = 1 / n ∑ _x C _x cost functions C _x for individual training examples x. This is done in the case of a quadratic cost function, where the cost of one teaching example is C _x = 1/2 || y - a ^L || ² This assumption will be true for all other cost functions that we will meet in the book.

We need this assumption because, in fact, backward propagation allows us to calculate the partial derivatives ∂C / ∂w and ∂C / ∂b, averaging over the training examples. By accepting this assumption, we assume that the training example x is fixed, and we will cease to indicate the index x, writing down the value C _x as C. Then we will return x, but for now, it is better to simply imply it.

The second assumption regarding the cost function - it can be written as a function of the output of the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost of one teaching example x can be written as

$C = 1/2 || y − a ^ L || ^ 2 = 1/2 \ sum_j (y_j - a ^ L_j) ^ 2 \ tag {27}$

which will be the function of output activations. Of course, this cost function also depends on the desired output y, and you may wonder why we do not consider C as a function also on y. However, recall that the input training example x is fixed, so the output y is also fixed. In particular, we cannot change it by changing weights and displacements, that is, this is not what the neural network learns. Therefore, it makes sense to consider C as a function of only output activations a ^L , and y as just a parameter that helps determine it.

Hadamard's work s⊙t

The back-propagation algorithm is based on the usual operations of linear algebra — addition of vectors, multiplication of a vector by a matrix, and so on. However, one of the operations is used less frequently. Suppose s and t are two vectors of the same dimension. Then by s⊙t we denote the elementwise multiplication of two vectors. Then the components of s⊙t are simply (s⊙t) _j = s _j t _j . For example:

$\ begin {bmatrix} 1 \\ 2 \ end {bmatrix} \ odot \ begin {bmatrix} 3 \\ 4 \ end {bmatrix} = \ begin {bmatrix} 1 * 3 \\ 2 * 4 \ end {bmatrix} = \ begin {bmatrix} 3 \\ 8 \ end {bmatrix} \ tag {28}$

This piecewise piece is sometimes called the Hadamard piece or the Schur piece. We will call it the work of Hadamard. Good libraries for working with matrices usually have a quick realization of Hadamard’s work, and this is convenient when implementing backpropagation.

Four fundamental equations based on backpropagation

Reverse propagation is associated with an understanding of how changing weights and network offsets changes the cost function. Essentially, this means counting the partial derivatives of /C / ∂w ^l _jk and ∂C / ^l b ^l _j . But to calculate them, we first calculate the intermediate value δ ^l _j , which we call the error in the neuron №j in the layer №l. Reverse propagation will give us a procedure for calculating the error δ ^l _j , and then associate δ ^l _j with ∂C / ∂w ^l _jk and ∂C / ∂ b ^l _j .

To understand how an error is determined, imagine that a demon wound up in our neural network:

He sits on the neuron number j in the layer number l. Upon receipt of input data, the daemon disrupts the neuron. It adds a small change Δz ^l _j to the weighted input of the neuron, and instead of producing σ (z ^l _j ), the neuron will give σ (z ^l _j + Δz ^l _j ). This change will spread through the next layers of the network, which ultimately will change the total cost by (C / ∂z ^l _j ) * Δz ^l _j .

But our demon is good, and he is trying to help you improve the cost, that is, to find Δz ^l _j that reduces the cost. Suppose that the value of ∂C / ∂z ^l _{j is} large (positive or negative). Then the demon can seriously reduce the cost by choosing Δz ^l _j with the sign opposite to ∂C / ∂z ^l _j . And if ∂C / ∂z ^l _{j is} close to zero, then the daemon cannot greatly improve the cost by changing the weighted input z ^l _j . So, from the point of view of the demon, the neuron is already close to the optimum (this, of course, is true only for small Δz ^l _j . Suppose, such are the limitations of the demon's actions). Therefore, in a heuristic sense, ∂C / ∂z ^l _j is a measure of neuron error.

Under the motivation from this story, we define the error δ ^l _{j of the} neuron j in the layer l, as

$\ delta l_j \ equiv \ frac {\ partial C} {\ partial z ^ l_j} \ tag {29}$

By convention, we use δ ^l to denote the error vector associated with the layer l. Reverse propagation will give us a way to calculate δ ^l for any stratum, and then correlate these errors with those values that really interest us, ∂C / ∂w ^l _jk and ∂C / ∂ b ^l _j .

You may wonder why the daemon changes the weighted input z ^l _j . After all, it would be more natural to imagine that the daemon changes the output activation a ^l _j so that we use ∂C / ∂a ^l _j as a measure of the error. In fact, if you do this, everything turns out to be very similar to what we will discuss further. However, in this case, the representation of back propagation will be algebraically slightly more complicated. Therefore, we dwell on the variant δ ^l _j = ∂C / ∂z ^l _j as a measure of the error.

In classification problems, the term “error” sometimes means the number of incorrect classifications. For example, if a neural network correctly classifies 96.0% of digits, then the error will be equal to 4.0%. Obviously, this is not at all what we mean by vectors δ. But in practice it is usually easy to understand what meaning is meant.

Attack Plan : Reverse Propagation Based on Four Fundamental Equations. Together they give us a way to calculate both the error δ ^l and the cost function gradient. I give them below. No need to expect their instant mastering. You will be disappointed. The equations of reverse propagation are so deep that for a good understanding of them requires tangible time and patience, and a gradual deepening of the question. The good news is that this patience pays off handsomely. Therefore, in this section, our reasoning is just beginning, helping you to go on the path of a deep understanding of equations.

Here is a diagram of how we will delve into these equations later: I will give a brief proof to help explain why they are true; we will rewrite them in algorithmic form in the form of pseudo-code, and see how to implement it in real python code; in the last part of the chapter, we will develop an intuitive understanding of the meaning of the back propagation equations, and how to find them from scratch. We will periodically return to the four fundamental equations, and the deeper you understand them, the more comfortable, and perhaps beautiful and natural, they will seem to you.

The error equation of the output layer, δ ^L : the components of δ ^L are considered as

$\ delta ^ L_j = \ frac {\ partial C} {\ partial a ^ L_j} \ sigma '(z ^ L_j) \ tag {BP1}$

Very natural expression. The first term on the right, ∂C / ∂a ^L _j , measures how quickly the cost changes as a function of output activation #j. If, for example, C is not particularly dependent on the specific output neuron j, then δ ^L _j will be small, as expected. The second term on the right, σ '(z ^L _j ), measures how quickly the activation function σ changes in z ^L _j .

Note that everything in (BP1) is easy to calculate. In particular, we calculate z ^L _j when calculating the behavior of the network, and the calculation of σ '(z ^L _j ) will take slightly more resources. Of course, the exact form ∂C / ∂a ^L _j depends on the form of the function cost. However, if the cost function is known, then there should be no problem with calculating ∂C / ∂a ^L _j . For example, if we use a quadratic cost function, then C = 1/2 ∑ _j (y _j - a ^L _j ) ² , therefore ∂C / ∂ a ^L _j = (a ^L _j - y _j ), which is easy to calculate.

Equation (BP1) is the componentwise expression δ ^L. It is perfectly normal, but not recorded in the matrix form, which we need for back distribution. However, it is easy to rewrite it in matrix form, as

$\ delta ^ L = \ nabla_a C \ odot \ sigma '(z ^ L) \ tag {BP1a}$

Here ∇ _a C is defined as a vector whose components are partial derivatives ∂C / ∂a ^L _j . It can be represented as an expression of the rate of change of C with respect to output activations. It is easy to see that equations (BP1a) and (BP1) are equivalent, therefore we will use (BP1) below to refer to any of them. For example, in the case of a quadratic cost, we will have ∇ _a C = (a ^L - y), therefore the full matrix form (BP1) will be

$\ delta ^ L = (a ^ L - y) \ odot \ sigma '(z ^ L) \ tag {30}$

Everything in this expression has a convenient vector form, and it is easy to calculate it using a library such as, for example, Numpy.

Error expression δ ^l through the error in the next layer, δ ^{l + 1} : in particular,

$\ delta ^ l = ((w ^ {l + 1}) ^ T \ delta ^ {l + 1}) \ cdot \ sigma '(z ^ l) \ tag {BP2}$

where (w ^{l + 1} ) ^T is the transposition of the weight matrix w ^{l + 1} for layer No. (l + 1). The equation seems complicated, but each element is easy to interpret. Suppose we know the error δ ^{l + 1} for the layer (l + 1). Transposing the weight matrix, (w ^{l + 1} ) ^T , can be thought of as moving the error back through the network, which gives us a measure of error at the output of layer #l. Then we count the Hadamard product ⊙σ '(z ^l ). This pushes the error back through the activation function in layer l, giving us the error value δl in the weighted input for layer l.

By combining (BP2) with (BP1), we can calculate the error δ ^l for any network layer. We start by using (BP1) to calculate δ ^L , then apply equation (BP2) to calculate δ ^L-1 , then again to calculate δ ^L-2 , and so on, up to the network stop in the opposite direction.

The equation for the rate of change of value with respect to any bias in the network : in particular:

$\ frac {\ partial C} {\ partial b ^ l_j} = \ delta ^ l_j \ tag {BP3}$

That is, the error δ ^l _{j is} exactly equal to the rate of change of ∂C / ∂b ^l _j . This is excellent because (BP1) and (BP2) have already told us how to calculate δ ^l _j . We can rewrite (BP3) in short, as

$\ frac {\ partial C} {\ partial b} = \ delta \ tag {31}$

where δ is estimated for the same neuron as offset b.

The equation for the rate of change of value with respect to any weight in the network: in particular:

$\ frac {\ partial C} {\ partial w ^ l_ {jk}} = a ^ {l-1} _k \ delta ^ l_j \ tag {BP4}$

From here we learn how to calculate the partial derivative ∂C / ∂w ^l _jk in terms of δ ^l and a ^l-1 , the method of calculation of which we already know. This equation can be rewritten in a form less loaded with indexes:

$\ frac {\ partial C} {\ partial w} = a _ {\ rm in} \ delta _ {\ rm out} \ tag {32}$

where a _in is the activation of the neural input for the weight w, and δ _out is the error of the neural output for the weight w. If you take a closer look at the weight w and the two neurons connected by him, you can draw it like this:

A pleasant consequence of equation (32) is that when the activation of a _in is small, a _in ≈ 0, the term of the gradient ∂C / ∂w also tends to zero. In this case, we say that the weight learns slowly, that is, it does not change much during the gradient descent. In other words, one of the consequences (BP4) will be that the weight output of neurons with low activation learns slowly.

Other ideas can be gleaned from (BP1) - (BP4). Let's start with the output layer. Consider the term σ '(z ^L _j ) in (BP1). Recall from the sigmoid plot from the last chapter that it becomes flat when σ (z ^L _j ) approaches 0 or 1. In these cases, σ '(z ^L _j ) ≈ 0. Therefore, the weight in the last layer will be learned slowly if the activation output of the neuron is small (≈ 0) or large (≈ 1). In this case, it is usually said that the output neuron is saturated, and as a result, the weight has stopped learning (or is learning slowly). The same remarks are true for the displacement of the output neuron.

Similar ideas can be obtained regarding earlier layers. In particular, consider the term σ '(z ^l ) in (BP2). This means that δ ^l _j is likely to be small as the neuron approaches saturation. And this, in turn, means that any weights at the input of a saturated neuron will learn slowly (although it will not work if w ^{l + 1} ^T δ ^{l + 1 has} sufficiently large elements that compensate for a small σ 'value (z ^L _j )).

To summarize: we learned that the weight will be learned slowly, if either the activation of the input neuron is small, or the output neuron is saturated, that is, its activation is small or large.

This is not particularly surprising. And yet, these observations help to improve our understanding of what happens when we train a network. Moreover, we can approach this reasoning from the reverse side. Four fundamental equations are valid for any activation function, and not just for the standard sigmoid (since, as we will see later, the sigmoid properties are not used in the proofs). Therefore, these equations can be used to develop activation functions with certain necessary learning properties. For example, suppose we chose the activation function σ, unlike sigmoid, such that σ 'is always positive and does not approach zero. This is to prevent learning from slowing down when the normal sigmoid neurons are saturated. Later in the book we will see examples where the activation function changes in a similar way. Considering equations (BP1) - (BP4), we can explain why such modifications are needed, and how they can affect the situation.

Bottom line: backpropagation equations

Tasks

Alternative write backpropagation equations. I wrote backpropagation equations using Hadamard’s work. This can confuse people who are not used to this work. There is another approach, based on the usual multiplication of matrices, which can be instructive for some readers. Show that (BP1) can be rewritten as

$\ delta ^ L = \ Sigma '(z ^ L) \ nabla_a C \ tag {33}$

where Σ '(z ^L ) is a square matrix, with σ' (z ^L _j ) located along the diagonal, and the other elements are equal to 0. Note that this matrix interacts with ∇ _a C through the usual matrix multiplication.

Show that (BP2) can be rewritten as

$\ delta ^ l = \ Sigma '(z ^ l) (w ^ {l + 1}) ^ T \ delta ^ {l + 1} \ tag {34}$

Combining previous tasks, show that:

$\ delta ^ l = \ Sigma '(z ^ l) (w ^ {l + 1}) ^ T \ ldots \ Sigma' (z ^ {L-1}) (w ^ L) ^ T \ Sigma '( z ^ L) \ nabla_a C \ tag {35}$

For readers familiar with matrix multiplication, this equation will be easier to understand than (BP1) and (BP2). I concentrate on (BP1) and (BP2) because this approach turns out to be faster to implement numerically. [here Σ is not a sum (∑), but a capital σ (sigma) / approx. trans. ]

Proof of four fundamental equations (optional section)

We now prove four fundamental equations (BP1) - (BP4). All of them are consequences of the chain rule (the rule of differentiation of a complex function) from the analysis of functions of many variables. If you are familiar with the chain rule, I strongly recommend trying to calculate the derivatives yourself before continuing reading.

We start with the equation (BP1), which gives us an expression for the output error δ ^L. To prove it, we recall that, by definition:

$\ delta ^ L_j = \ frac {\ partial C} {\ partial z ^ L_j} \ tag {36}$

Applying the chain rule, we rewrite the partial derivatives through the partial derivatives on the output activations:

$\ delta ^ L_j = \ sum_k \ frac {\ partial C} {\ partial a ^ L_k} \ frac {\ partial a ^ L_k} {\ partial z ^ L_j} \ tag {37}$

where the summation goes over all neurons k in the output layer. Of course, the output activation a ^L _{k of the} neuron No. _k depends only on the weighted input z ^L _j for the neuron No. j, when k = j. Therefore, ∂a ^L _k / ∂z ^L _j disappears when k ≠ j. As a result, we simplify the previous equation to

$\ delta ^ L_j = \ frac {\ partial C} {\ partial a ^ L_j} \ frac {\ partial a ^ L_j} {\ partial z ^ L_j} \ tag {38}$

Recalling that a ^L _j = σ (z ^L _j ), we can rewrite the second term on the right as σ '(z ^L _j ), and the equation turns into

$\ delta ^ L_j = \ frac {\ partial C} {\ partial a ^ L_j} \ sigma '(z ^ L_j) \ tag {39}$

that is, in (BP1) in an exploded view.

Then we prove (BP2), which gives the equation for the error δ ^l through the error in the next layer δ ^{l + 1} . To do this, we need to rewrite δ ^l _j = ∂C / ∂z ^l _{j in} terms of δ ^{l + 1} _k = C / ∂z ^{l + 1} _k . This can be done using the chain rule:

$\delta^l_j = \frac{\partial C}{\partial z^l_j} \tag{40}$

$= \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}$

$= \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \tag{42}$

where in the last line we swapped the two terms on the right, and substituted the definition δ ^{l + 1} _k . To calculate the first term on the last line, we note that

$z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k \tag{43}$

Differentiating, we get

$\frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \tag{44}$

Substituting this into (42), we get

$\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}$

That is, (BP2) in component entry.

It remains to prove (BP3) and (BP4). They also follow from the chain rule, in much the same way as the two previous ones. I leave them to you as an exercise.

Exercise

Prove (BP3) and (BP4).

That's all proof of the four fundamental backpropagation equations. It may seem complicated. But in reality, this is simply the result of careful application of the chain rule. Speaking less concisely, the reverse distribution can be imagined as a way to calculate the value function gradient through the systematic application of a chain rule from the analysis of functions of many variables. And this is really all that is the reverse spread - the rest is just details.

Back Propagation Algorithm

Backpropagation equations give us a method to calculate the gradient of the cost function. Let's write this explicitly in the form of an algorithm:

Input x: Assign the appropriate activation a ¹ for the input layer.
: l = 2,3,…,L z ^l = w ^l a ^l−1 +b ^l a ^l = σ(z ^l ).
δ ^L : δ ^L = ∇ _a C ⊙ σ'(z ^L ).
: l = L−1,L−2,…,2 δ ^l = ((w ^l+1 ) ^T δ ^l+1 ) ⊙ σ'(z ^l ).
: $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and $\frac{\partial C}{\partial b^l_j} = \delta^l_j$ .

Looking at the algorithm, you will understand why it is called reverse propagation. We calculate the error vector δ ^l back to front, starting from the last layer. It may seem strange that we are going back online. But if you think about proving reverse propagation, then the reverse movement is a consequence of the fact that cost is a function of the network output. To understand how the cost varies depending on the early weights and offsets, we need to apply the chain rule time after time, going back through the layers to get useful expressions.

Exercises

. , , f(∑ _j w _j x _j +b), f – , . ?
. , σ(z) = z . .

As I explained earlier, the back-propagation algorithm computes the gradient of the cost function for one training example, C = C _x . In practice, back propagation is often combined with a learning algorithm, for example, with a stochastic gradient descent, when we calculate the gradient for many teaching examples. In particular, given a mini-package of m training examples, the following algorithm applies a gradient descent based on this mini-package:

Login: a set of teaching examples.
For each training example x, assign the appropriate input activation a ^{x, 1} and perform the following steps:
- l=2,3,…,L z ^x,l = w ^l a ^x,l−1 +b ^l a ^x,l = σ(z ^x,l ).
- δ ^x,L : δ ^x,L = ∇ _a C _x ⋅ σ'(z ^x,L ).
- : l=L−1,L−2,…,2 δ ^x,l = ((w ^l+1 ) ^T δ ^x,l+1 ) ⋅ σ'(z ^x,l ).
: l=L,L−1,…,2 , and offsets according to the rule $w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$ $b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}$ .

Of course, to implement the stochastic gradient descent, in practice, you also need an external cycle, generating mini-packages of training examples, and an external cycle, passing through several epochs of training. For simplicity, I lowered them.

Code for back distribution

Having understood the abstract side of back propagation, we can now understand the code used in the previous chapter that implements back propagation. Recall from that chapter that the code was contained in the update_mini_batch methods and the backprop class Network. The code of these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates the network weights and offsets by counting the gradient for the current mini_batch tutorial examples:

class Network(object): ... def update_mini_batch(self, mini_batch, eta): """    ,          -. mini_batch –    (x, y),  eta –  .""" nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] for x, y in mini_batch: delta_nabla_b, delta_nabla_w = self.backprop(x, y) nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)] nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)] self.weights = [w-(eta/len(mini_batch))*nw for w, nw in zip(self.weights, nabla_w)] self.biases = [b-(eta/len(mini_batch))*nb for b, nb in zip(self.biases, nabla_b)]

Most of the work is done with delta_nabla_b, delta_nabla_w = self.backprop (x, y) strings, using the backprop method to calculate the partial derivatives ∂C _x / ∂b ^l _j and ∂C _x / ∂w ^l _jk . The backprop method almost repeats the algorithm of the previous section. There is one slight difference - we use a slightly different approach to indexing layers. This is done in order to take advantage of the python feature, negative array indices, which allow counting elements back, from the end. l [-3] will be the third element from the end of the array l. The backprop code is shown below, along with the auxiliary functions used to calculate the sigmoid, its derivative, and the derivative of the cost function. With them, the code is complete and understandable. If something is unclear, refer to the first description of the code with a full listing.

 class Network(object): ... def backprop(self, x, y): """  ``(nabla_b, nabla_w)``,      C_x. ``nabla_b``  ``nabla_w`` -    numpy,   ``self.biases`` and ``self.weights``.""" nabla_b = [np.zeros(b.shape) for b in self.biases] nabla_w = [np.zeros(w.shape) for w in self.weights] #   activation = x activations = [x] #      zs = [] #     z- for b, w in zip(self.biases, self.weights): z = np.dot(w, activation)+b zs.append(z) activation = sigmoid(z) activations.append(activation) #   delta = self.cost_derivative(activations[-1], y) * \ sigmoid_prime(zs[-1]) nabla_b[-1] = delta nabla_w[-1] = np.dot(delta, activations[-2].transpose()) """ l      ,      . l = 1    , l = 2 – ,   .    ,   python      .""" for l in xrange(2, self.num_layers): z = zs[-l] sp = sigmoid_prime(z) delta = np.dot(self.weights[-l+1].transpose(), delta) * sp nabla_b[-l] = delta nabla_w[-l] = np.dot(delta, activations[-l-1].transpose()) return (nabla_b, nabla_w) ... def cost_derivative(self, output_activations, y): """    ( C_x /  a)   .""" return (output_activations-y) def sigmoid(z): """.""" return 1.0/(1.0+np.exp(-z)) def sigmoid_prime(z): """ .""" return sigmoid(z)*(1-sigmoid(z))

Task

Fully matrix-based back distribution on a mini package. Our implementation of a stochastic gradient descent uses a cycle of learning examples from a mini-package. The back-propagation algorithm can be changed so that it calculates the gradients for all the mini-packet training examples at the same time. Instead of starting with one vector x, we can start with the matrix X = [x ₁ x ₂ ... x _m ], whose columns will be the mini-packet vectors. Direct propagation occurs through the product of weight matrices, the addition of a suitable matrix for offsets and the widespread use of sigmoids. Reverse distribution follows the same pattern. Write pseudocode for this approach for the back-propagation algorithm. Modify network.py to use this matrix approach. The advantage of this approach will be the use of all the advantages of modern libraries for linear algebra. As a result, it can work faster than the mini-packet cycle (for example, on my computer, the program speeds up about 2 times on the MNIST classification tasks). In practice, all serious libraries for back distribution use such a full-matrix approach or some variant of it.

In what sense is backward propagation a fast algorithm?

In what sense is backward propagation a fast algorithm? To answer this question, consider another approach to the calculation of the gradient. Imagine the early days of researching neural networks. Perhaps this is the 1950s or 1960s, and you are the first person in the world to come up with a gradient descent to learn! But for this to work, you need to calculate the gradient of the cost function. You remember algebra and decide to see if the chain rule can be used to calculate the gradient. Having played a little, you see that algebra seems complicated and you are disappointed. You are trying to find a different approach. You decide to consider the cost as a function of only C = C (w) weights (we will return to displacements a little later). You number the weights w ₁ , w ₂ , ... and you want to calculate C / ∂w _j for the weight w _j . The obvious way is to use the approximation.

$\ frac {\ partial C} {\ partial w_ {j}} \ approx \ frac {C (w + \ epsilon e_j) -C (w)} {\ epsilon} \ tag {46}$

Where ε> 0 is a small positive number, and e _j is the unit vector of direction j. In other words, we can estimate ∂C / ∂w _j approximately, calculating the cost C for two slightly different values of w _j , and then applying equation (46). The same idea will allow us to calculate the partial derivatives of ∂C / ∂b by the offsets.

The approach looks promising. Conceptually simple, easy to implement, uses only a few lines of code. It looks much more promising than the idea of using a chain rule to calculate the gradient!

Unfortunately, although this approach looks promising, when implemented in the code, it turns out that it works extremely slowly. To understand why, imagine that we have a million weights on the net. Then for each weight w _j we need to calculate C (w + εe _j ) in order to calculate ∂C / ∂w _j . And this means that to calculate the gradient we need to calculate the cost function a million times, which would require a million direct passes through the network (for each teaching example). And we also need to calculate C (w), so it turns out a million and one pass through the network.

The trick of backpropagation is that it allows us to simultaneously compute all partial derivatives C / ∂w _j using only one forward pass through the network, followed by one back pass. Roughly speaking, the computational cost of a return pass is about the same as that of a direct pass.

Therefore, the total cost of back distribution is about the same as the two direct passes through the network. Compare this with the million and one straight pass needed to implement method (46)! So, although the reverse distribution outwardly looks like a more complex approach, in reality it is much more rapid.

For the first time, this acceleration was fully appreciated in 1986, and this dramatically expanded the range of tasks solved using neural networks. In turn, this led to an increase in the number of people using neural networks. Of course, the reverse spread is not a panacea. Even in the late 1980s, people already came across its limitations, especially when trying to use backward propagation to train deep neural networks, that is, networks with many hidden layers. Later we will see how modern computers and new cunning ideas allow us to use back propagation to train such deep neural networks.

Reverse distribution: in general

As I have already explained, the reverse distribution reveals two riddles to us. First, what does the algorithm actually do? We have developed an error back distribution scheme from the output. Is it possible to go deeper, get a more intuitive idea of what is happening during all these multiplications of vectors and matrices? The second mystery - how could anyone even detect the reverse spread? It's one thing to follow the steps of the algorithm or the proof of its work. But this does not mean that you understood the task so well that you could invent this algorithm. Is there a reasonable chain of reasoning that could lead us to the discovery of the back-propagation algorithm? In this section I will cover both riddles.

To improve the understanding of the algorithm, let's imagine that we made a small change in Δw ^l _{jk of} a certain weight w ^l _jk :

This change in weight will lead to a change in the output activation of the corresponding neuron:

This will change all the activations of the next layer:

These changes will lead to changes in the next layer, and so on, down to the last, and then to changes in the cost function:

The change in ΔC is related to the change in Δw ^l _{jk by the} equation

$\ Delta C \ approx \ frac {\ partial C} {\ partial w ^ l_ {jk}} \ Delta w ^ l_ {jk} \ tag {47}$

It follows that a probable approach to calculating ∂C / ∂w ^l _jk is to closely track the propagation of a small change w ^l _jk , leading to a small change in C. If we can do this, carefully expressing everything along the way in quantities that are easy to calculate then we can calculate and ∂C / ∂w ^l _jk .

Let's try. The change in Δw ^l _jk causes a small change in Δa ^l _j in the activation of neuron j in the layer l. This change is set

$\ Delta a ^ l_j \ approx \ frac {\ partial a ^ l_j} {\ partial w ^ l_ {jk}} \ Delta w ^ l_ {jk} \ tag {48}$

The change in activation Δa ^l _j leads to changes in all activations of the next layer, (l + 1). We will focus only on one of these modified activations, for example, a ^{l + 1} _q ,

This will result in the following changes:

$\ Delta a ^ {l + 1} _q \ approx \ frac {\ partial a ^ {l + 1} _q} {\ partial a ^ l_j} \ Delta a ^ l_j \ tag {49}$

Substituting equation (48), we obtain:

$\ Delta a ^ {l + 1} _q \ approx \ frac {\ partial a ^ {l + 1} _q} {\ partial a ^ l_j} \ frac {\ partial a ^ l_j} {\ partial w ^ l_ { jk}} \ Delta w ^ l_ {jk} \ tag {50}$

Of course, changing Δa ^{l + 1} _q will change and activate in the next layer. We can even imagine a path across the entire network from w ^l _jk to C, where each activation change leads to a change in the next activation, and finally to a change in the cost at the output. If the path passes through activations a ^l _j , a ^{l + 1} _q , ..., a ^{L − 1} _n , a ^L _m , then the final expression will be

$\ Delta C \ approx \ frac {\ partial C} {\ partial a ^ L_m} \ frac {\ partial a ^ L_m} {\ partial a ^ {L-1} _n} \ frac {\ partial a ^ {L -1} _n} {\ partial a ^ {L-2} _p} \ ldots \ frac {\ partial a ^ {l + 1} _q} {\ partial a ^ l_j} \ frac {\ partial a ^ l_j} { \ partial w ^ l_ {jk}} \ Delta w ^ l_ {jk} \ tag {51}$

That is, we choose a member of the form ∂a / ∂a for each next neuron that we pass through, as well as for the term ∂C / ∂a ^L _m at the end. This is a representation of changes in C due to changes in activations along this particular path through the network. Of course, there are many ways in which a change in w ^l _jk can pass and affect the cost, and we considered only one of them. To calculate the total change in C, it is reasonable to assume that we must sum up all possible paths from weight to final cost:

$\ Delta C \ approx \ sum_ {mnp \ ldots q} \ frac {\ partial C} {\ partial a ^ L_m} \ frac {\ partial a ^ L_m} {\ partial a ^ {L-1} _n} \ frac {\ partial a ^ {L-1} _n} {\ partial a ^ {L-2} _p} \ ldots \ frac {\ partial a ^ {l + 1} _q} {\ partial a ^ l_j} \ frac {\ partial a ^ l_j} {\ partial w ^ l_ {jk}} \ Delta w ^ l_ {jk} \ tag {52}$

where we summed up all the possible choices for intermediate neurons along the way. Comparing this with (47), we see that:

$\ frac {\ partial C} {\ partial w ^ l_ {jk}} = \ sum_ {mnp \ ldots q} \ frac {\ partial C} {\ partial a ^ L_m} \ frac {\ partial a ^ L_m} {\ partial a ^ {L-1} _n} \ frac {\ partial a ^ {L-1} _n} {\ partial a ^ {L-2} _p} \ ldots \ frac {\ partial a ^ {l + 1} _q} {\ partial a ^ l_j} \ frac {\ partial a ^ l_j} {\ partial w ^ l_ {jk}}. \ tag {53}$

Equation (53) looks complicated. However, he has a pleasant intuitive interpretation. We count the change in C relative to the weights of the network. It tells us that each edge between two neurons of the network is associated with a relation factor, which is only a partial derivative of the activation of one neuron with respect to the activation of another neuron. At the edge from the first weight to the first neuron, the ratio factor is ∂a ^l _j / ∂w ^l _jk . The ratio factor for a path is simply the product of the coefficients throughout the path. And the total change coefficient ∂C / ∂w ^l _jk is the sum of the coefficients for all paths from the initial weight to the final cost. This procedure is shown below for one path:

So far we have given a heuristic argument, a way to present what is happening when the weights of the network change. Let me outline a further way of thinking on this topic for the development of this argument. First, one can derive exact expressions for all individual partial derivatives in equation (53). This is easy to do using simple algebra. After that, you can try to understand how to record all sums of indices in the form of matrix products. This turns out to be a tedious task that requires patience, but not something extraordinary. After all this and the maximum simplification, you will see that exactly the same back propagation algorithm has turned out! Therefore, the back propagation algorithm can be thought of as a way to calculate the sum of the coefficients over all paths. Or, to rephrase, the back-propagation algorithm is an ingenious way of tracking small changes in weights (and offsets) when they propagate through the network, reach output and affect cost.

Here I will not do all this. This case is unattractive, requiring careful study of details. If you are ready for this, you might like to do it. If not, then I hope that such reflections will give you some ideas about the goals of back distribution.

What about another riddle - how was it possible to open a reverse distribution? In fact, if you follow the path I have outlined, you will receive proof of back distribution. Unfortunately, the proof will be longer and more complicated than what I described earlier. So how was that short (but even more mysterious) evidence revealed? If you write down all the details of a long proof, you will immediately see several obvious simplifications. You apply simplifications, get simpler proof, write it down. And then you again see some obvious simplifications. And you repeat the process. After several repetitions, we’ve got the proof that we saw earlier - short, but a bit incomprehensible, because all the guiding milestones are removed from it! I, of course, suggest that you take my word, but there really is no mystery about the origin of the given evidence. Just a lot of hard work to simplify the proof I described in this section.

However, in this process there is one clever trick. In equation (53), intermediate variables are activations, such as a ^{l + 1} _q . The trick is to switch to using weighted inputs, such as z ^{l + 1} _q , as intermediate variables. If you do not use this, and continue to use the activation, the evidence obtained will be a little more complicated than this earlier in this chapter.

Source: https://habr.com/ru/post/457980/

All Articles