Reading Habr for materials on neural networks and in general on the topic of artificial intelligence, I found a post about a
single-layer perceptron and decided out of curiosity to start studying neural networks with it, and then expand the experience to a multi-layer perceptron. What and narrate.
Theory
The multilayer perceptron is well described in the
Wiki , but only the structure is described there, but we will try it in practice, along with the learning algorithm. By the way, it is
also described in the Wiki , although comparing it with several other sources (books and
aiportal.ru ), I found several problem areas here and there.
So, the multilayer perceptron is a neural network consisting of layers, each of which consists of elements - neurons (more precisely, their models). These elements are of three types: sensory (input, S), associative (trained “hidden” layers, A) and responsive (output, R). This type of perceptrons is called multi-layered not because it consists of several layers, because the input and output layers can not be made out in code at all, but because it contains
several (usually no more than two or three)
trained (A) layers .
A neuron model (we will call it just a neuron) is a network element that has several inputs, each of which has weight. A neuron, receiving a signal, multiplies the signals by weights and sums the resulting values, and then transmits the result to another neuron or to the output of the network. Here, too, the multilayer perceptron is different. Its function is
sigmoid , it gives values on the interval from 0 to 1. Several functions belong to
sigmoids , we will mean a
logistic function . During the examination of the method, you will understand why it is so good.
Several layers that can be trained (more precisely, adjust) allow you to approximate very complex nonlinear functions, that is, their scope is wider than single-layer ones.
We try in practice
We will immediately shift the theory to practice so that we can remember better and everyone could try.
I recommend reading the
above post , if you are not in neural networks, of course.
So, take a simple task - to recognize the numbers without turning and distortion. In such a task, a multi-layer perceptron will be sufficient; moreover, it is less sensitive to noise.
')
Network structure
The network will have two hidden layers, 5 times smaller in size than the input. That is, if we have 20 inputs, then on the hidden layers there will be 4 neurons each. In the case of this task, I allow myself the courage to select the number of layers and neurons empirically. Layers take 2, with an increase in the number of layers, the result does not improve.
Learning algorithm
The training of neural networks of the chosen type is carried out using
the error back-propagation algorithm . That is, if, when we answer, the layers transmit a signal to the output of the network, then we will compare the response of the neural network with the correct one and calculate the error, which then goes up through the network - from the outputs to the inputs.
We will estimate the network error as half the sum of squares of the differences in the signals at the outputs. More simply: we divide in half the sum of i such expressions here: (ti - oi) ^ 2, where ti is the value of the i-th signal in the correct answer, and oi is the value of the i-th output of the neural network. That is, we summarize the squares of the errors on the inputs and divide everything in half. If this error (in the example code is $ d) is large enough (does not fit into the accuracy we need), we rule the weights of the neurons.
Formulas for the amendment of the scales are fully posted on the
wiki , I will not repost. I just want to note that I tried to literally repeat the formulas so that it would be clear how it is in practice. Here comes the advantage of choosing the activation function — it has a simple derivative (σ '(x) = σ (x) * (1-σ (x))), and it is used to correct the weights. The weights of each layer are adjusted separately. That is, layer by layer from the last to the first. And here I made a mistake, at first I corrected the weights on each example separately, and the neural network learned to solve only one “example”. It is correct in such an algorithm to give all the examples of a training sample to the inputs in turn, this is called an
epoch . And only with the end of an epoch to count an error (total for all examples of the sample) and to adjust weights.
During training, jumps in errors are possible, but it happens. The selection of the coefficients α (determines the effect of weights on training) and η (determines the influence of the correction value δ) is very important - the rate of convergence and falling into local extrema depend on it. I consider α = 0.7 and η = 0.001 to be the most universal, although try to play with them: an increase in α and η speeds up learning, but we can fly a minimum.
Next, lay out
an example in PHP . The code is far from ideal, but it performs its tasks.