📜 ⬆️ ⬇️

Introduction to Neural Networks

image

Artificial neural networks are now at the peak of popularity. One may wonder whether the big name has played its role in marketing and the application of this model. I know some business managers who happily mention the use of “artificial neural networks” and “deep learning” in their products. Would they be so glad if their products used “models with connected circles” or “cars“ make a mistake - will you be punished ”?” But, without a doubt, artificial neural networks are a worthwhile thing, and this is evident due to their success in many applications: image recognition, processing of natural languages, automated trading and autonomous cars. I am an expert in data processing and analysis, but I did not understand them before, so I felt like a master who had not mastered his tool. But finally, I completed my “homework” and wrote this article to help others overcome the same obstacles that I encountered during my (still ongoing) training.

The R code for the examples presented in this article can be found here in the Bible of machine learning problems . In addition, after reading this article, it is worth exploring part 2, Neural Networks - A Worked Example , which provides details of creating and programming a neural network from scratch.

We will start with a motivating challenge. We have a set of images in grayscale, each of which is a grid of 2 × 2 pixels, in which each pixel has a brightness value from 0 (white) to 255 (black). Our goal is to create a model that will find images with the “ladder” pattern.
')


At this stage, we are only interested in finding a model that can logically select data. The selection methodology will be interesting to us later.

Preliminary processing


In each image we mark the pixels. $ inline $ x_ {1} $ inline $ , $ inline $ x_ {2} $ inline $ , $ inline $ x_ {3} $ inline $ , $ inline $ x_ {4} $ inline $ and generate the input vector $ inline $ x = \ begin {bmatrix} x_ {1} & x_ {2} & x_ {3} & x_ {4} \ end {bmatrix} $ inline $ Which will be the input data of our model. We expect our model to predict True (the image contains a staircase pattern) or False (the image does not contain a staircase pattern).



ImageIdx1x2x3x4Isstatairs
one252four155175TRUE
2175ten186200TRUE
382131230100FALSE
..................
4983618743249FALSE
499one160169242TRUE
50019813422188FALSE

Single-layer perceptron (iteration of model 0)


We can build a simple model consisting of a single - layer perceptron . A perceptron uses a weighted linear combination of input data to return a forecast estimate. If the forecast estimate exceeds the selected threshold, then the perceptron predicts True. Otherwise, it predicts False. If more formally, then

$$ display $$ f (x) = {\ begin {cases} 1 & {\ text {if}} \ w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4> threshold \\ 0 & {\ text {otherwise}} \ end {cases }} $$ display $$



Let's put it another way

$$ display $$ \ widehat y = \ mathbf w \ cdot \ mathbf x + b $$ display $$



$$ display $$ f (x) = {\ begin {cases} 1 & {\ text {if}} \ \ widehat {y}> 0 \\ 0 & {\ text {otherwise}} \ end {cases}} $$ display $$


Here $ inline $ \ hat {y} $ inline $ - our estimate of the forecast .

Graphically, we can represent the perceptron as input nodes that transmit data to the output node.

image


For our example, we build the following perceptron:

$$ display $$ \ hat {y} = - 0.0019x_ {1} + 0.0016x_ {2} + 0.0020x_ {3} + 0.0023x_ {4} + 0.0003 $$ display $$


This is how the perceptron will work on some of the training images.

image


This is definitely better than random guesswork and has common sense. All staircase patterns have dark pixels in the bottom row, which creates large positive coefficients. $ inline $ x_ {3} $ inline $ and $ inline $ x_ {4} $ inline $ . However, there are obvious problems with this model.

  1. The model yields a real number, the value of which correlates with the concept of similarity (the larger the value, the higher the probability that there is a ladder in the image), but there is no basis for interpreting these values ​​as probabilities, because they can be outside the interval [0 , one].
  2. The model cannot capture the non-linear relationships between variables and the target value. To verify this, consider the following hypothetical scenarios:

Case A

Let's start with the image x = [100, 0, 0, 125]. Will increase $ inline $ x_ {3} $ inline $ from 0 to 60.

image


Case B

Let's start with the previous image, x = [100, 0, 60, 125]. Will increase $ inline $ x_ {3} $ inline $ from 60 to 120.

image


Intuitively, case A should increase much more. $ inline $ \ hat {y} $ inline $ than the case of B. However, since our perceptron model is a linear equation, the increment is +60 $ inline $ x_ {3} $ inline $ in both cases will result in a gain of +0.12 $ inline $ \ hat {y} $ inline $ .

Our linear perceptron has other problems, but let's solve these two first.

Single-layer perceptron with sigmoid activation function (iteration of model 1)


We can solve problems 1 and 2 by wrapping our perceptron in a sigmoid (with the subsequent selection of other weights). It is worth recalling that the "sigmoid" function is an S-shaped curve bounded along the vertical axis between 0 and 1, so that it is often used to simulate the probability of a binary event.

$$ display $$ sigmoid (z) = \ frac {1} {1 + e ^ {- z}} $$ display $$


image


In accordance with this thought, we can supplement our model with the following image and equation.

image


$$ display $$ z = w \ cdot x = w_1x_1 + w_2x_2 + w_3x_3 + w_4x_4 $$ display $$


$$ display $$ \ widehat y = sigmoid (z) = \ frac {1} {1 + e ^ {- z}} $$ display $$


Looks familiar? Yes, this is our old friend, logistic regression . However, it will serve us well for interpreting the model as a linear perceptron with a sigmoid activation function, because it gives us more opportunities for a more general approach. Also, since we can now interpret $ inline $ \ hat {y} $ inline $ as a probability , we need to change the decision rule accordingly.

$$ display $$ f (x) = {\ begin {cases} 1 & {\ text {if}} \ \ widehat {y}> 0.5 \\ 0 & {\ text {otherwise}} \ end {cases}} $$ display $$



Let's continue with our example of the task and we will assume that we have the following selected model:

$$ display $$ \ begin {bmatrix} w_1 & w_2 & w_3 & w_4 \ end {bmatrix} = \ begin {bmatrix} -0.140 & -0.145 & 0.121 & 0.092 \ end {bmatrix} $$ display $$


$$ display $$ b = -0.008 $$ display $$


$$ display $$ \ widehat y = \ frac {1} {1 + e ^ {- (- 0.140x_1 -0.145x_2 + 0.121x_3 + 0.092x_4 -0.008)}} $$ display $$


Let's observe how this model behaves on the same examples of images from the previous section.

image


We definitely managed to solve problem 1. See how it solves problem 2.

Case A

Let's start with the image [100, 0, 0, 100]. Will increase $ inline $ x_3 $ inline $ "from 0 to 50.

image


Case B

Let's start with the image [100, 0, 50, 100]. Will increase $ inline $ x_3 $ inline $ "from 50 to 100.

image


Notice how the curvature of the sigmoid causes case A to "work" (quickly increase) with increasing $ inline $ z = w \ cdot x $ inline $ but the pace slows down with continued increase $ inline $ z $ inline $ . This is consistent with our intuitive understanding that case A should reflect a greater increase in the probability of a staircase pattern than case B.

image

Unfortunately, this model still has problems.

  1. $ inline $ \ widehat y $ inline $ has a monotonous relationship with each variable. But what if we need to recognize stairs of a lighter shade?
  2. The model does not take into account the interaction of variables. Suppose that the bottom row of the image is black. If the upper left pixel is white, then darkening the upper right pixel should increase the likelihood of the staircase pattern. If the upper left pixel is black, then shading the upper right pixel should reduce the likelihood of the staircase. In other words, an increase $ inline $ x_3 $ inline $ should potentially lead to an increase or decrease $ inline $ \ widehat y $ inline $ , depending on the values ​​of other variables. In our current model, this can not be achieved.

Multilayer perceptron with a sigmoid activation function (iteration of model 2)


We can solve both of the above problems by adding another layer to the perceptron model. We will create several basic models similar to those presented above, but we will transmit the output of each basic model to the input of another perceptron. This model is actually a “vanilla” neural network. Let's see how it can work in different examples.

Example 1: ladder pattern recognition

  1. We will build a model that is triggered by the recognition of "left ladders", $ inline $ \ widehat y_ {left} $ inline $
  2. Let's build a model that works when recognizing the “right stairs”, $ inline $ \ widehat y_ {right} $ inline $
  3. Add an assessment to the base models so that the final sigmoid will work only if both values ​​( $ inline $ \ widehat y_ {left} $ inline $ , $ inline $ \ widehat y_ {right} $ inline $ ) are great

image


image


Another variant

  1. Let's build a model that works when the bottom row is dark, $ inline $ \ widehat y_1 $ inline $
  2. Construct a model that fires when the top left pixel is dark and the top right pixel is light, $ inline $ \ widehat y_2 $ inline $
  3. Construct a model that fires when the top left pixel is bright and the top right pixel is dark, $ inline $ \ widehat y_3 $ inline $
  4. Add the base models so that the final sigmoid function only works when $ inline $ \ widehat y_1 $ inline $ and $ inline $ \ widehat y_2 $ inline $ great or when $ inline $ \ widehat y_1 $ inline $ and $ inline $ \ widehat y_3 $ inline $ are great. (Notice that $ inline $ \ widehat y_2 $ inline $ and $ inline $ \ widehat y_3 $ inline $ can't be big at the same time.)


image


image


Example 2: Recognize stairs in a light shade.

  1. Build models that trigger with “shaded bottom row”, “shaded x1 and white x2”, “shaded x2 and white x1” $ inline $ \ widehat y_1 $ inline $ , $ inline $ \ widehat y_2 $ inline $ and $ inline $ \ widehat y_3 $ inline $
  2. Build models that are triggered by the “dark bottom row”, “dark x1 and white x2”, “dark x2 and white x”, $ inline $ \ widehat y_4 $ inline $ , $ inline $ \ widehat y_5 $ inline $ and $ inline $ \ widehat y_6 $ inline $
  3. Connect the models so that the "dark" identifiers are subtracted from the "shaded" identifiers before compressing the result with a sigmoid


image

image

Terminology Note


One layer perceptron has one output layer . That is, the models we constructed will be called two- layer perceptrons, because they have an output layer, which is the input of another output layer. However, we can call the same models neural networks, in which case the networks have three layers — the input layer, the hidden layer, and the output layer.



Alternative activation functions


In our examples, we used the sigmoid activation function. However, other activation functions can be used. Often used tanh and relu . The activation function must be non-linear, otherwise the neural network will be simplified to a similar single-layer perceptron.

Multi-class classification


We can easily expand our model so that it works in a multi-class classification by using multiple nodes in the final output layer. The idea here is that each output node corresponds to one of the classes. $ inline $ C $ inline $ which we strive to predict. Instead of narrowing the output with a sigmoid, which reflects an element from $ inline $ \ mathbb {R} $ inline $ in the element from the interval [0, 1] we can use the softmax function , which reflects the vector in $ inline $ \ mathbb {R} ^ n $ inline $ in vector in $ inline $ \ mathbb {R} ^ n $ inline $ in such a way that the sum of the elements of the resulting vector is equal to 1. In other words, we can create a network that gives the output vector [ $ inline $ prob (class_1) $ inline $ , $ inline $ prob (class_2) $ inline $ , ..., $ inline $ prob (class_C) $ inline $ ].



Using three or more layers (deep learning)


You may ask yourself: is it possible to expand our “vanilla” neural network so that its output layer is transmitted to the fourth layer (and then to the fifth, sixth, etc.)? Yes. This is usually called “deep learning.” In practice, it can be very effective. However, it is worth noting that any network consisting of more than one hidden layer can be simulated with a network with one hidden layer. Indeed, according to the universal approximation theorem, any continuous function can be approximated using a neural network with one hidden layer. The reason for the frequent selection of deep neural network architectures instead of networks with one hidden layer is that during the selection procedure they usually converge to a solution faster.



Selection of a model for marked training samples (back propagation of a training error)


Alas, but we got to the selection procedure. Before that, we talked about how neural networks can work efficiently, but did not discuss how the neural network is adapted to the marked training samples. An analogue of this question might be: “How can one select the best weights for the network based on several marked training samples?”. The usual answer is gradient descent (although MMP may be appropriate). If we continue to work on our example of the problem, then the procedure of the gradient descent may look like this:

  1. We start with some marked tutorial data.
  2. We choose to minimize the differentiable loss function, $ inline $ L (\ mathbf {\ widehat Y}, \ mathbf {Y}) $ inline $
  3. Choose a network structure. Especially clearly need to determine the number of layers and nodes on each layer.
  4. We initialize the network with random weights.
  5. Pass the training data through the network to generate a forecast for each sample. Let's measure the total error according to the loss function, $ inline $ L (\ mathbf {\ widehat Y}, \ mathbf {Y}) $ inline $ . (This is called direct distribution.)
  6. We determine how much the current losses are changing with respect to the small changes in each of the weights. In other words, we calculate the gradient $ inline $ L $ inline $ taking into account each weight in the network. (This is called back distribution.)
  7. Make a small “step” in the direction of the negative gradient. For example, if $ inline $ w_ {23} = 1.5 $ inline $ , but $ inline $ \ frac {\ partial L} {\ partial w_ {23}} = 2.2 $ inline $ , then decrease $ inline $ w_ {23} $ inline $ by a small amount should lead to a slight decrease in current losses. Therefore we change $ inline $ w_3: = w_3 - 2.2 \ times 0.001 $ inline $ (where 0.001 is the specified “step size”).
  8. Repeat this process (from step 5) a fixed number of times or until the losses converge.

At least that is the basic idea. When implemented in practice, there are many difficulties.

Difficulty 1 - computational complexity


In the selection process, among other things, we need to calculate the gradient $ inline $ L $ inline $ taking into account each weight. It is difficult because $ inline $ L $ inline $ depends on each node in the output layer, and each of these nodes depends on each node in the layer in front of it, and so on. This means that the calculation $ inline $ \ frac {\ partial L} {\ partial w_ {ab}} $ inline $ turns into a real nightmare with complex derivative formulas. (Do not forget that many neural networks in the real world contain thousands of nodes in dozens of layers.) This problem can be solved by noting that when using a complex derivative formula, most $ inline $ \ frac {\ partial L} {\ partial w_ {ab}} $ inline $ reuses identical intermediate derivatives. If you keep a close eye on this, you will be able to avoid the same repeated calculations thousands of times.

Another trick is to use special activation functions, the derivatives of which can be written as a function of their value. For example, the derivative $ inline $ sigmoid (x) $ inline $ = $ inline $ sigmoid (x) (1 - sigmoid (x)) $ inline $ . This is convenient because during the straight pass when calculating $ inline $ \ widehat y $ inline $ for each training sample we need to calculate $ inline $ sigmoid (\ mathbf {x}) $ inline $ elementwise for some vector $ inline $ \ mathbf {x} $ inline $ . During back propagation, we can reuse these values ​​to calculate the gradient $ inline $ L $ inline $ taking into account weights, which will save time and memory.

The third trick is to divide the training samples into “mini-groups” and to change the weights for each group, one after another. For example, if we divide the training data into {batch1, batch2, batch3}, then the first pass through the training data will be

  1. Change weights based on batch1
  2. Change weights based on batch2
  3. Change weights based on batch3

where is the gradient $ inline $ L $ inline $ recalculated after each change.

Finally, it is worth mentioning another technique - the use of a video processor instead of the central processor, because it is better suited to perform a large number of parallel computing.

Difficulty 2 - Gradient descent may have problems finding the absolute minimum


This is not a problem of neural networks, but of a gradient descent. There is a possibility that during the gradient descent, the weights may get stuck at a local minimum. It is also possible that the weight "jumped" at least. One way to handle this is to use different step sizes. Another way is to increase the number of nodes and / or layers in the network. (But it is worth fearing an overly close fit). In addition, some heuristic techniques can be effective, for example, using the moment .

Difficulty 3 - how to develop a common approach?


How to write a genetic program that can pick up values ​​for any neural network with any number of nodes and layers? The correct answer - no, you need to use Tensorflow . But if you want to try, the most difficult part is the calculation of the loss function gradient. The trick here is to determine that a gradient can be represented as a recursive function. A five-layer neural network is simply a four-layer neural network that transmits data to some kind of perceptrons. But a neural network with four layers is just a neural network with three layers, transferring data to some perceptrons, and so on. More formally, this is called automatic differentiation .

Source: https://habr.com/ru/post/342334/


All Articles