We publish a translation of the last of the existing parts of the "book." Be sure to follow the blog of the author and continue to publish this material as soon as they appear.
Content:
Chapter 1: Real Value Schemes Chapter 2: Machine Learning Now that we understand the basics of how these schemes work with data, let's apply a more traditional approach, which you probably have seen somewhere on the Internet and in other lessons and books. You are unlikely to find people talking too much about the characteristics of force. Instead, machine learning algorithms are usually described in terms of loss functions (or cost functions, or goals).
')
As I compose these mathematical formulas, I would like to begin to take a closer look at how we call our variables and parameters. I would like these equations to look just like you could see them in books or other lessons, so I’ll start using more standard names.
Example: 2-dimensional support vector machineLet's start with a two-dimensional SVM example. We are given a data set consisting of N examples (xi0, xi1) and their corresponding labels yi, which may have values of + 1 / −1 for positive and negative examples, respectively. Most importantly, as you remember, we have three parameters (w0, w1, w2). The SVM loss function is then determined as follows:
Please note that this expression is always positive, in view of the determination of the threshold value from zero in the first expression and squaring during regularization. The point is that we need this expression to have the least possible meaning. Before we take up its subtleties, let's first present it in the form of code:
var X = [ [1.2, 0.7], [-0.3, 0.5], [3, 2.5] ]
Notice how this expression works: it measures how bad our SVM classifier is. Let's look at this in more detail:
- The first data entry point xi = [1.2, 0.7] with the label yi = 1 gives the result 0.1 * 1.2 + 0.2 * 0.7 + 0.3, which is equal to 0.56. Please note that this is a positive example, so we need the result to be more than +1. 0.56 is not a high enough value. And, in fact, the cost expression for this data entry point will calculate the following: costi = Math.max (0, -1 * 0.56 + 1), which is 0.44. You can think of costs as a quantitative expression of SVM failures.
- The second data entry point xi = [-0.3, 0.5] with the label yi = -1 gives the result 0.1 * (- 0.3) + 0.2 * 0.5 + 0.3, which is equal to 0.37. It does not look very good: This result is very high for a negative example. It must be less than -1. In fact, when we calculate costs: costi = Math.max (0, 1 * 0.37 + 1), we get 1.37. This is a very high cost for this example, as it is incorrectly classified.
- The last example xi = [3, 2.5] with the label yi = 1 gives the result 0.1 * 3 + 0.2 * 2.5 + 0.3, that is, 1.1. In this case, SVM calculates costi = Math.max (0, -1 * 1.1 + 1), which in fact is equal to zero. This data entry point is classified correctly, and there are no costs associated with it.
The cost function is an expression that measures how poorly your classifier works. If the training data set is ideally classified, the costs (with the exception of regularization) will be zero.
Please note that the last element of losses is the cost of regularization, which means that the parameters of our model should have small values. Due to this element, costs will never actually be zero (as this would mean that all the parameters of the model, with the exception of a systematic error, are actually zero), but the closer we get, the better our classifier will work.
Most of the cost functions in Learning Machines consist of two parts:
1 . The part that measures how well the model fits the data, and
2 : Regularization, which measures the principle of how complex or convenient the model is.
I hope I convinced you that in order to get a very good SVM, we really need to make sure that the costs are as low as possible. Sounds familiar? We know exactly what to do: the cost function written above is our scheme. We will carry out all the examples through the scheme, calculate the backward pass and update all the parameters so that the scheme gives us a lower cost value in the future. In particular, we will calculate the gradient, then update the parameters in the opposite direction from the gradient (since we want to make the costs low, not high).