Hacker's guide to neural networks. Schemes of real values. Become a master back propagation error

Content:

Chapter 1: Real Value Schemes

  :        №1:

Part 2:

   №2:

Part 3:

   №3:

Part 4:

Part 5:

    «»   " "

Part 6:

Chapter 2: Machine Learning

Part 7:

Part 8:

         (SVM)

Part 9:

SVM

Part 10:

Over time, you can write back passages much more efficiently, even for complex circuits and for everything at once. Let's practice a little backtracking on a few examples. In the following, we will simply use variables such as a, b, c, x, and their gradients will be called da, db, dc, dx, respectively. Again, we represent variables as a “forward flow,” and their gradients as a “reverse flow” along each line. Our first example was the logical element *:

 var x = a * b; //     x (dx),  ,          : var da = b * dx; var db = a * dx;

In the above code, I assume that we have been given the variable dx, which comes from somewhere above the scheme, while we perform backward error propagation (or otherwise, it defaults to +1). I write it out, because I want to clearly show how the gradients are related. Note that in equations, the logic element * acts as a switch in the process of backward pass. He remembers what his original values were, and the gradients for each value will be equal to the other in the process of backward passage. And then, of course, we will have to multiply it by the gradient above, which is a chain rule. This is what the logical element + looks like in a compressed form:

 var x = a + b; // -> var da = 1.0 * dx; var db = 1.0 * dx;

Where 1.0 is a local gradient, and multiplication is our chain rule. But what about the addition of three numbers ?:
')

 //   x = a + b + c   : var q = a + b; //   1 var x = q + c; //   2 //  : dc = 1.0 * dx; //       dq = 1.0 * dx; da = 1.0 * dq; //       db = 1.0 * dq;

You understand what is happening? If you remember the reverse flow scheme , the logical element + just takes the gradient from above and sends it equally to all its original values (since its local gradient is always just 1 for all its original values, regardless of their actual values). Therefore, we can do it much faster:

 var x = a + b + c; var da = 1.0 * dx; var db = 1.0 * dx; var dc = 1.0 * dx;

Well, what about combining logical elements ?:

 var x = a * b + c; //  dx,           => da = b * dx; db = a * dx; dc = 1.0 * dx;

If you do not understand what is happening above, let's enter a temporary variable q = a * b, then we calculate x = q + c to check everything. And here is our neuron, so let's calculate everything in two stages:

 //       : var q = a*x + b*y + c; var f = sig(q); // sig –    //    ,   df, : var dq = (q * (1 - q)) * df; //         var da = x * dq; var dx = a * dq; var dy = b * dq; var db = y * dq; var dc = 1.0 * dq;

I hope something clears up here. And now, how about this:

 var x = a * a; var da = //???

You can consider this example as a value moving towards the logical element *, but the line is divided and represents both of the original values. This is actually simple, as the backflow of gradients is always added. In other words, nothing changes:

 var da = a * dx; //  a    da += a * dx; //

// shorter form looks like this:

 var da = 2 * a * dx;

I hope you remember that the derivative of the function f (a) = a ^ 2 is 2a , which is exactly the value that we get if we represent it as a line divided into two initial values of the logical element.
Let's look at another example:

 var x = a*a + b*b + c*c; //  : var da = 2*a*dx; var db = 2*b*dx; var dc = 2*c*dx;

Well, now let's complicate things a bit:

 var x = Math.pow(((a * b + c) * d), 2); // pow(x,2)      JS

When more complex cases come across in practice, I like to divide the expression into parts that are easy to solve, which almost always consist of simpler expressions, after which I link them using the chain rule:

 var x1 = a * b + c; var x2 = x1 * d; var x = x2 * x2; //      x //         : var dx2 = 2 * x2 * dx; //     x2 var dd = x1 * dx2; //     d var dx1 = d * dx2; //     x1 var da = b * dx1; var db = a * dx1; var dc = 1.0 * dx1; // !

It was easy. These are backpropagation equations for the whole expression, and we did it bit by bit, and then we did the backward propagation for all variables. Again, note that for each variable in the process of going forward, we had an equivalent variable during the backward pass, which contains a gradient with respect to the final result of the circuit. Here are some more useful features and their local gradients that can be useful in practice:

 var x = 1.0/a; //  var da = -1.0/(a*a);      : var x = (a + b)/(c + d); //    : var x1 = a + b; var x2 = c + d; var x3 = 1.0 / x2; var x = x1 * x3; //    //     ,    : var dx1 = x3 * dx; var dx3 = x1 * dx; var dx2 = (-1.0/(x2*x2)) * dx3; //  ,   ,    var da = 1.0 * dx1; // , ,    var db = 1.0 * dx1; var dc = 1.0 * dx2; var db = 1.0 * dx2;

I hope you understand that we split the expressions, perform forward passage, after which for each variable (for example, a) we look for the derivative of its gradient da when moving in the opposite direction, taking turns using simple local gradients and associating them with the upper gradients. Here is another example:

 var x = Math.max(a, b); var da = a === x ? 1.0 * dx : 0.0; var db = b === x ? 1.0 * dx : 0.0;

Here, a very simple expression becomes hard to read. The max function passes the initial value that was the largest and ignores the rest. During the backward pass, the logic element max will simply take the gradient from the top and direct it to the original value, which actually flowed through it as it passes forward. The logical element acts as a simple switch based on which source element had the most value when going forward. Other initial values will have a zero gradient. That is what it means ===, since we check what the initial value was actually maximum, and just direct the gradient to it.

Finally, let's look at the non-linearity of the rectified linear element (ReLU), which you probably heard about. It is used in neural networks instead of sigmoid function. It simply crosses the threshold at zero:

 var x = Math.max(a, 0) //           : var da = a > 0 ? 1.0 * dx : 0.0;

In other words, this logical element simply passes the value if it is greater than 0, or stops the stream and sets it equal to zero. During the backward pass, the logic element will transmit the gradient from above, if it is activated during the forward passage, or if the original initial value was below zero, it will stop the flow of the gradient.

At this moment I will stop. I hope you understand a little bit how you can calculate whole expressions (which consist of many logical elements) and how you can calculate the inverse error propagation for each of them.
All that we have done in this chapter comes down to the following: We realized that we can enter some initial value through a complex arbitrary scheme with real values, push the end of the scheme with some force, and the reverse propagation of error will transmit this impetus to the entire scheme back to the original values. If the initial values react slightly along the opposite direction of their push, the circuit will “feed” slightly towards the original direction of tension. Perhaps this is not so obvious, but such an algorithm is a powerful hammer for learning the machine.

Source: https://habr.com/ru/post/246397/

All Articles

Hacker's guide to neural networks. Schemes of real values. Become a master back propagation error

More articles: