Content:
Chapter 1: Real Value Schemes Chapter 2: Machine Learning In the last chapter, we looked at schemes with real values that computed complex expressions of their original values (pass forward), and also we were able to calculate the gradients of these expressions from the original original values (back pass). In this chapter we will understand how useful this rather simple mechanism in learning a machine can be.
Binary classificationAs before, let's start with a simple one. The most simple, standard and quite common machine learning problem is binary classification. Many interesting and important problems can be reduced to it. For example, we are given a data set of N vectors and each of them is labeled with the value +1 or -1. In a two-dimensional form, our data set may look like this:
vector -> label
--------------- [1.2, 0.7] -> +1 [-0.3, 0.5] -> -1 [-3, -1] -> +1 [0.1, 1.0] -> -1 [3.0, 1.1] -> -1 [2.1, -3] -> +1
')
Here we have N = 6 data entry points, where each point has two characteristics (D = 2). Three data points have a +1 mark, and the other three have a -1 mark. This is the simplest example, but in practice, a data set of + 1 / -1 can be really useful: for example, identifying spam / not spam among emails, in which vectors somehow evaluate various characteristics of the content of letters, such as the number of mentions of a certain magic tool .
Purpose. Our goal in binary classification is to deal with a function that accepts a two-dimensional vector and predicts a label. This function is usually parameterized by a specific set of parameters, and we need to adjust the parameters of this function so that its results match the labels in the given data set. Ultimately, we can discard the data set and use the detected parameters to predict labels for previously unknown vectors.
Training protocolWe will finally start building whole neural networks and complex expressions, but let's start with a simple one and teach a linear classifier that is very similar to one neuron that we looked at at the end of Chapter 1. The only difference is that we will give up sigmoids, because it unnecessarily complicates everything (I used it only as an example in Chapter 1, since historically sigmoid neurons are popular, although modern neural networks rarely use non-linearity sigmoids). In any case, let's take a simple linear function:
f (x, y) = ax + by + cIn this expression, we consider
x and
y as the initial values (two-dimensional vectors), and
a, b, c as the function parameters that we need to know. For example, if a = 1, b = -2, c = -1, then the function will take the first data entry point ([1.2, 0.7]) and the result will be 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2. This is how the training will work:
1. We select an arbitrary data entry point and draw it through the circuit.
2. We interpret the result of the scheme to ensure that the data entry point has a class of +1 (that is, very high values: the scheme is absolutely sure that the data entry point has a class of +1, and very low values: the scheme is absolutely sure that data entry point has class -1).
3. We measure how well the forecast builds the presented labels. To show clearly - for example, if a positive example gives out very low values, we will need to pull it up in a positive direction according to the scheme, requiring it to give a higher value for this data entry point. Notice that this is an example for the first data entry point: it has a +1 mark, but our prediction function assigns it a value of -1.2. Therefore, we push it according to a scheme in a positive direction. We need the value to be higher.
4. The scheme will take a push and respond back to the error to calculate the push to the original values
a, b, c, x, y .
5. Since we consider x, y as (fixed) data entry points, we will ignore the tension in relation to x, y. If you like my physical analogies, then imagine these original values as pegs driven into the ground.
6. On the other hand, we will take the parameters a, b, c and make them respond to their push (i.e. we will perform the so-called update of the parameters). This, of course, can cause the circuit to produce slightly higher values at this particular data entry point in the future.
7. Repeat! Go back to step 1.
The learning scheme I described above generally relates to the Stochastic Gradient Descent. An interesting point that I would like to repeat once again is that
a, b, c, x, y consist of the same elements, as far as the scheme can provide: they are the initial values of the scheme, and the scheme will push them all in a certain way. direction. She does not know the difference between the parameters and data entry points. However, after the completion of the back pass, we ignore all the jolts at the data points (x, y) and continue to load and unload them as our examples in the data set repeat. On the other hand, we save the parameters (a, b, c) and continue to push them every time we measure a data entry point. Over time, the tension with respect to these parameters will adjust these values in such a way that the function will produce high values for positive examples and low values for negative ones.