Dropout - a method for solving the problem of retraining in neural networks

Overfitting is one of the problems of deep neural networks (Deep Neural Networks, DNN), as follows: the model well explains only examples from the training sample, adapting to the training examples, instead of learning to classify examples that did not participate in the training (losing generalization ability). In recent years, many solutions have been proposed to the problem of retraining, but one of them has surpassed all the others due to its simplicity and excellent practical results; this solution is Dropout (in Russian sources - “thinning method”, “exclusion method” or simply “dropout”).

Dropout

A graphical representation of the Dropout method, taken from the article in which it was first introduced. On the left, the neural network before Dropout was applied to it, on the right, the same network after Dropout.
The network shown on the left is used for testing, after learning the parameters.
')
The main idea of Dropout is, instead of learning one DNN, to train an ensemble of several DNNs, and then to average the results.

Networks for training are obtained by dropping out neurons with probability

$p$ thus, the probability that the neuron will remain online is

$q = 1-p$ . The “exception” of a neuron means that it returns 0 for any input data or parameters.

Excluded neurons do not contribute to the learning process at any stage of the back-propagation error algorithm (backpropagation); therefore, the exclusion of at least one of the neurons is tantamount to learning a new neural network.

Quoting authors ,

In a standard neural network, the derivative obtained by each parameter tells it how it should change in order to minimize the final loss function, taking into account the activity of the other blocks. Therefore, the blocks may change, correcting the errors of other blocks. This can lead to excessive joint adaptation (co-adaptation), which, in turn, leads to retraining, since these joint adaptations cannot be generalized to data that were not involved in the training. We hypothesize that Dropout prevents joint adaptation for each hidden block, making the presence of other hidden blocks unreliable. Therefore, a hidden block cannot rely on other blocks to correct its own mistakes.

In a nutshell, Dropout works well in practice, because it prevents neurons from inter-adapting during the learning phase.

Having a rough idea of the Dropout method, let's take a closer look at it.

How Dropout Works

As already mentioned, Dropout turns off neurons with probability

$p$ and, as a result, leaves them included with probability

$q = 1-p$ .

The probability of shutting down each neuron is the same . This means the following:

Provided that

$h (x) = xW + b$ - linear projection of the input $d_i$ -dimensional vector x on $d_h$ -dimensional output space;
$a (h)$ - activation function,

the application of Dropout to this projection at the training stage can be represented as a modified activation function:

$f (h) = D⊙a (h),$

Where

$D = (X_1, ⋯, X_ {d_h})$ -

$d_h$ -dimensional vector of random variables

$X_i$ distributed according to the law of Bernoulli.

$X_i$ has the following probability distribution:

$$ display $$ f (k; p) = \ begin {equation *} \ begin {cases} p, & \ mathrm {if} & k = 1 \\ 1-p, & \ mathrm {if} & k = 0 \ end {cases} \ end {equation *}, $$ display $$

Where

$k$ - all possible output values.

Obviously, this random variable perfectly matches the Dropout applied to a single neuron. Indeed, the neuron is turned off with probability

$p = P (k = 1)$ otherwise, leave it on.

Let's look at applying Dropout to the i- th neuron:

$$ display $$ O_i = X_ia (\ sum_ {k = 1} ^ {d_i} {w_kx_k + b}) = \ begin {equation *} \ begin {cases} a (\ sum_ {k = 1} ^ {d_i } {w_kx_k + b}), & \ mathrm {if} & X_i = 1 \\ 0, & \ mathrm {if} & X_i = 0 \ end {cases} \ end {equation *}, $$ display $$

Where

$P (X_i = 0) = p$ .

Since at the learning stage the neuron remains in the network (does not undergo shutdown) with probability q , at the testing stage we need to emulate the behavior of the ensemble of neural networks used in training. To do this, the authors propose at the testing stage to multiply the activation function by the factor q . In this way,

At the training stage :

$O_i = X_ia (\ sum_ {k = 1} ^ {d_i} {w_kx_k + b})$ ,

At the testing stage :

$O_i = qa (\ sum_ {k = 1} ^ {d_i} {w_kx_k + b})$

Reverse (Inverted) Dropout

It is possible to use a slightly different approach - reverse Dropout . In this case, we multiply the activation function by a factor not during the test phase, but during the training.

The coefficient is the inverse of the probability that the neuron will remain in the network:

$\ frac {1} {1-p} = \ frac {1} {q}$ , in this way,

At the training stage :

$O_i = \ frac {1} {q} X_ia (\ sum_ {k = 1} ^ {d_i} {w_kx_k + b})$ ,
At the testing stage :

$O_i = a (\ sum_ {k = 1} ^ {d_i} {w_kx_k + b})$

In many frameworks for deep learning, Dropout is implemented just in this modification, since it only needs to describe the model once and then start training and testing on this model, changing only the parameter (Dropout coefficient).

In the case of direct Dropout, we are forced to change the neural network for testing, because without multiplying by q, the neuron will return values higher than those expected to receive subsequent neurons; this is why the implementation of the reverse Dropout is more common.

Dropout of multiple neurons

It is easy to notice that the layer

$h$ of

$n$ neurons in a separate step of the learning phase can be considered as an ensemble of

$n$ Bernoulli's experiments with success probability

$p$ .

Thus, at the exit layer

$h$ we get the following number of excluded neurons:

$Y = \ sum_ {i = 1} ^ {d_h} (1-X_i)$

Since each neuron is represented as a random variable, distributed according to the Bernoulli law, and all these quantities are independent, the total number of excluded neurons is also a random variable, but with a binomial distribution:

$Y∼Bi (d_h, p),$

where is the probability

$k$ successful events for

$n$ Attempts are characterized by the following distribution density:

$f (k; n, p) = \ begin {pmatrix} n \\ k \ end {pmatrix} p ^ k (1-p) ^ {n-k}$

This formula is easily explained as follows:

$p ^ k (1-p) ^ {n-k}$ - probability of obtaining a sequence $k$ successful events for $n$ attempts, and therefore $n-k$ unsuccessful.
$\ begin {pmatrix} n \\ k \ end {pmatrix}$ - the binomial coefficient used to calculate the number of possible successful sequences.

Now we can use this distribution to calculate the probability of a certain number of neurons disconnecting.

Using the Dropout method, we fix the Dropout coefficient.

$p$ for a certain layer and expect that a proportional number of neurons will be excluded from this layer.

For example, if the layer to which we applied Dropout consists of

$n = 1024$ neurons as well

$p = 0.5$ , we expect 512 of them to be disabled. Let's check this statement:

$Y = \ sum_ {i = 1} ^ {1024} X_i∼Bi (1024,0.5)$

$P (Y = 512) = \ begin {pmatrix} 1024 \\ 512 \ end {pmatrix} 0.5 ^ {512} (1-0.5) ^ {1024-512} \ approx0.025$

As you can see, the probability of disconnecting exactly

$np = $ 51$ neurons is only 0.025!

The following Python 3 script will help you imagine how many neurons will be turned off for different values

$p$ and fixed amount

$n$ .

import matplotlib.pyplot as plt from scipy.stats import binom import numpy as np # number of neurons n = 1024 # number of tests (input examples) size = 500 # histogram bin width, for data visualization binwidth = 5 for p in range(1, 10): # per layer probability prob = p / 10 # generate size values from a bi(n, prob) rnd_values = binom.rvs(n, prob, size=size) # draw histogram of rnd values plt.hist( rnd_values, bins=[x for x in range(0, n, binwidth)], # normalize = extract the probabilities normed=1, # pick a random color color=np.random.rand(3, 1), # label the histogram with its probability label=str(prob)) plt.legend(loc='upper right') plt.show()

The binomial distribution peaks in the area

$np$

As we see from the image above, for any

$p$ average number of disabled neurons in proportion

$np$ , i.e

$E [Bi (n, p)] = np$

Moreover, it can be noted that the distribution of values is almost symmetrical with respect to

$p = 0.5$ and the probability of shutdown

$np$ neurons increase as the distance increases from

$p = 0.5$ .

The scaling factor was introduced by the authors to compensate for the activation values, since only a fraction of

$1-p$ neurons, while at the testing stage all 100% of neurons remain switched on, and, therefore, the values obtained must be reduced using a special coefficient.

Dropout and other regularizers

Dropout is often used with L2 normalization and other methods for limiting parameters (for example, Max Norm ). Normalization methods help to maintain low values of model parameters.

In short, L2 normalization is an additional element of the loss function, where

$\ lambda \ in [0,1]$ - a hyperparameter called “regularization strength” (regularization strength),

$F (W; x)$ - model, and

$\ epsilon$ - error function between real value

$y$ and predicted

$\ hat {y}$ :

$\ mathcal {L} (y, \ hat {y}) = \ epsilon (y, F (W; x)) + \ frac {\ lambda} {2} W ^ 2$

It is easy to see that this additional element reduces the amount by which the parameter changes during the back propagation of an error using the gradient descent method. If a

$\ eta$ - learning rate coefficient, parameter

$w \ in {W}$ updated to the following value:

$w \ leftarrow {w- \ eta (\ frac {\ partial F (W; x)} {\ partial w} + \ lambda w)}$

Dropout alone cannot prevent an increase in parameter values during the update phase. Moreover, the reverse Dropout causes the update steps to become even larger, as shown below.

Reverse Dropout and other regularizers

Since Dropout does not prevent the growth of parameters and their overflow, L2 regularization can help us (or any other regularization method that limits the values of the parameters).

If we express the Dropout coefficient explicitly, the equation above will turn into the following:

$w \ leftarrow {w- \ eta (\ frac {1} {q} \ frac {\ partial F (W; x)} {\ partial w} + \ lambda w)}$

It is easy to see that in the case of the reverse Dropout, the learning rate is scaled using the coefficient

$q$ . Because

$q$ belongs to the interval

$[0; 1]$ the relationship between

$\ eta$ and

$q$ may take values from the following interval:

$r (q) = \ frac {\ eta} {q} \ in [\ eta = \ lim_ {q \ rightarrow 1} r (q), + \ infty = \ lim_ {q \ rightarrow 0} r (q) ]$

Therefore, from now on we will call

$q$ accelerating factor , as it increases learning speed.

$r (q)$ we will call effective learning rate .

The effective learning rate is higher than the learning rate chosen by us: in this way, the normalization method that limits the values of the parameters can simplify the process of choosing the learning rate.

Results

Dropout exists in two versions: direct (used infrequently) and reverse.
A dropout on a single neuron can be represented as a random variable with a Bernoulli distribution.
A dropout on a set of neurons can be represented as a random variable with a binomial distribution.
Despite the fact that the probability that exactly np neurons will be switched off from the network, np is the average number of neurons disconnected in a layer of n neurons.
Reverse Dropout increases learning speed.
Reverse Dropout should be used in conjunction with other normalization methods that limit parameter values in order to simplify the process of selecting learning speeds.
Dropout helps prevent learning problems in deep neural networks.

Oh, and come to work with us? :)
wunderfund.io is a young foundation that deals with high-frequency algorithmic trading . High-frequency trading is a continuous competition of the best programmers and mathematicians of the whole world. By joining us, you will become part of this fascinating fight.

We offer interesting and challenging data analysis and low latency tasks for enthusiastic researchers and programmers. Flexible schedule and no bureaucracy, decisions are quickly made and implemented.

Join our team: wunderfund.io

Source: https://habr.com/ru/post/330814/

All Articles

Dropout - a method for solving the problem of retraining in neural networks

Dropout

How Dropout Works

Reverse (Inverted) Dropout

Dropout of multiple neurons

Dropout and other regularizers

Reverse Dropout and other regularizers

Results

More articles: