Content:
Chapter 1: Real Value Schemes Chapter 2: Machine Learning As a concrete example, let's look at SVM. SVM is a very popular linear classifier. Its functional form has exactly the same form as I described in the previous section -
f (x, y) = ax + by + c . At this stage, if you saw the description of SVM, you probably expect that I will determine the loss function of SVM and dive into explanations of free variables, geometric concepts of large fields, kernels, duality, etc. But here I would like to use a different approach.
Instead of defining loss functions, I would like my explanation to be based on the characteristic of force (by the way, I have just invented this term) of the support vector machine, which I personally consider much more understandable. As we will see later, the characteristic of force and the function of losses are the same ways of solving the same problem. In general, it looks like this:
"
Characteristic of force ":
If we pass a positive data entry point through the SVM scheme, and the output value is less than 1, we need to pull the scheme with a force of +1. This is a positive example, so we need its value to be higher.
')
On the other hand, if we pass a negative data entry point through the SVM, and the output value is greater than -1, then the circuit gives this data entry point a dangerously high value: You must pull the circuit down with a force of -1.
In addition to pulling up, you should always add a little tension for parameters a, b (note, not c!), Which will pull them in the direction of zero. You can imagine that a, b are tied to a physical spring that is tied to zero. As in the case of a physical spring, this will make the tension proportional to the value of each element a, b (Hooke's law in physics, does anyone know?). For example, if a takes on too high a value, it will experience strong tension with the value of | a | back to zero. This tension is what we call regularization, and it ensures that none of our parameters a or b become disproportionately large. This may be undesirable, since both parameters a and b are multiplied by the input characteristics x, y (remember that our equation is a * x + b * y + c), so if any of them is too high , our classifier will be too sensitive to these parameters. This is not a very good property, since the characteristics can often be inaccurate in practice, so we need our classifier to change relatively smoothly if they swing sideways.
Let's quickly go through a small but concrete example. Suppose we started with setting an arbitrary parameter, say, a = 1, b = -2, c = -1. Then, if we feed the point [1.2, 0.7], the SVM will calculate the value 1 * 1.2 + (-2) * 0.7 - 1 = -1.2. This point is marked as +1 in the training data, so we want the value to be greater than 1. The gradient at the top of the circuit will then be positive: +1, and will propagate the backward error to a, b, c. In addition, there will be a regularization tension on a with a force of -1 (to make it smaller) and a regularization tension on b with a force of +2 to make it larger, in the direction of zero.
Instead, suppose we supply the data entry point [-0.3, 0.5] to the SVM. It calculates 1 * (-0.3) + (-2) * 0.5 - 1 = -2.3. The label for this point has a value of -1, and since -2.3 is less than -1, we understand that, according to our strength characteristic, SVM should be happy: the calculated value will be extremely negative and will correspond to the negative label in this example. At the end of the scheme there will be no tension (i.e. it will be zero), since no changes are needed. However, there will still be a regularization tension on a with a strength of -1 and on b with a strength of +2.
In general, too much text. Let's write the SVM code and take advantage of the schema mechanism discussed in Chapter 1:
This is a circuit that simply computes a * x + b * y + c and can also calculate a gradient. It uses the logic element code we created in Chapter 1. Now let's write down the SVM, which does not depend on the actual circuit. For it, only the values that go out of it are important, and it tightens the circuit.
Now let's use SVM using the stochastic gradient descent:
var data = []; var labels = []; data.push([1.2, 0.7]); labels.push(1); data.push([-0.3, -0.5]); labels.push(-1); data.push([3.0, 0.1]); labels.push(1); data.push([-0.1, -1.0]); labels.push(-1); data.push([-1.0, 1.1]); labels.push(-1); data.push([2.1, -3]); labels.push(1); var svm = new SVM();
This code displays the following result:
training accuracy at iteration 0: 0.3333333333333333 training accuracy at iteration 25: 0.3333333333333333 training accuracy at iteration 50: 0.5 training accuracy at iteration 75: 0.5 training accuracy at iteration 100: 0.3333333333333333 training accuracy at iteration 125: 0.5 training accuracy at iteration 150: 0.5 training accuracy at iteration 175: 0.5 training accuracy at iteration 200: 0.5 training accuracy at iteration 225: 0.6666666666666666 training accuracy at iteration 250: 0.6666666666666666 training accuracy at iteration 275: 0.8333333333333334 training accuracy at iteration 300: 1 training accuracy at iteration 325: 1 training accuracy at iteration 350: 1 training accuracy at iteration 375: 1
We see that initially the accuracy of our classifier’s training was only 33%, but by the end of the training the examples are correctly classified as the parameters a, b, c adjust their values in accordance with the applied tension force. We just trained SVM! But please, never use this code in practice :) We will see how you can do everything much more efficiently when we figure out what, in fact, is happening.
The number of repetitions is necessary. With the data of this example, with the initialization of this example, and with the adjustment of the used step size, 300 repetitions were required to train the SVM. In practice, they may be much larger or much smaller, depending on how difficult or large the problem is, how you initialize, normalize the data, what step size you use and so on. This is a model example, but later I will go through all the best techniques for the actual training of these classifiers in practice. For example, it appears that setting the step size is extremely important and difficult. A small step size will make your model slow in learning. A large step size will train faster, but if it is too large, it will make your classifier jump randomly and this will not lead to a good end result. In the end, we will use deferred data verification for more precise tuning in order to take the most advantageous position for your data.
One thing I want you to appreciate is that the circuit can be an arbitrary expression, and not just a linear prediction function, which we used in this example. For example, it can be a whole neural network.
By the way, I specifically structured the code in a modular way, but we could demonstrate SVM with much simpler code. This is what all these classes and calculations actually boil down to:
var a = 1, b = -2, c = -1;
This code produces the exact same result. Probably at the moment you can just take a look at the code and see how these equations came out.
Let's make a small note about this: You may have noticed that the tension is always 1.0 or -1. You can imagine other options, for example, tension is proportional to how serious the error was. This leads to a variation on the SVM, which some people call the square piecewise linear SVM loss function, why - you will understand a little later. Depending on the different characteristics of your data set, it may work better or worse. For example, if you have very poor data performance, for example, a negative data entry point that has a value of +100, its effect on the classifier will be relatively insignificant, since we will simply pull with a force of -1 regardless of how serious mistake. In practice, we perceive this property of the classifier as resistance to indicators.
Let's briefly repeat. We introduced the problem of binary classification, where we are given N D-dimensional vectors and a label + 1 / -1 for each. We have seen that we can combine these characteristics with a set of parameters inside the circuit with real values (for example, the schema Mach of support vectors, as in our example). Then we can repeatedly carry out our data through the scheme and each time change the parameters so that the output value of the scheme corresponds to the specified labels. The change depends, most importantly, on our ability to perform backward propagation of gradient errors through the circuit. Ultimately, the final scheme can be used to predict the values of unknown examples!