Pattern recognition. Beginning theory

Introduction

In this article, I set out to highlight some of the fundamental results of the theory of machine learning in such a way that the concepts are understandable to readers who are somewhat familiar with the problems of classification and regression. The idea to write such an article was more and more clearly manifested in my mind with each book read, in which the ideas of teaching machine recognition were told as if from the middle and it’s not at all clear what the authors of this or that method relied on in developing it. On the other hand, there are a number of books devoted to basic concepts in machine learning, but the presentation of the material in them may seem too complicated for the first reading.

Motivation

Consider this task. We have apples of two classes - tasty and not tasty, 1 and 0. Apples have signs - color and size. The color will change continuously from 0 to 1, i.e. 0 - completely green apple, 1 - completely red. The size can change in the same way, 0 - small apple, 1 - large. We would like to develop an algorithm that would give color and size to the input, and at the output would give the apple class - whether it is tasty or not. It is highly desirable that the number of errors in this case be the smaller the better. At the same time, we have a final list, which contains historical data on the color, size and class of apples. How could we solve this problem?

Logical approach

Solving our problem, the first method that may come to mind may be this: let's manually create rules like if-else and, depending on the values of color and size, we will assign an apple to a certain class. Those. we have the prerequisites - it's color and size, and there is a consequence - the taste of an apple. It is quite reasonable when there are few signs and it is possible to assess thresholds by comparison. But it may happen that it is not possible to come up with clear conditions, and it is not obvious from the data which thresholds to take, and the number of signs may increase in perspective. And what to do if in our list with historical data, we found two apples with the same color and size, but one is marked as tasty and the other is not? Thus, our first method is not as flexible and scalable as we would like.

Legend

We introduce the following notation. Will denote $i$ apple like $x_i$ . In turn each $x_i$ consists of two numbers - color and size. This fact we will denote by a pair of numbers: $x_i = (color, size)$ . Class each $i$ th apple we denote as $y_i$ . List with historical data denoted by the letter $S$ , the length of this list is equal to $N$ . $i$ The -th element of this list is the value of the attributes of an apple and its class. Those. $S = \ {(x_1, y_1), \ dots, (x_N, y_N) \} = \ {((color, size) _1, y_1), \ dots, ((color, size) _N, y_N) \}$ . We will also call $S$ by sampling. Capital letters $X$ and $Y$ we denote variables that can take values of a specific attribute and class. We're going to introduce a new concept - the decisive rule $a (x)$ There is a function that accepts color and size $x$ , and on output returns a class label: $a: X \ rightarrow Y$
')

Probabilistic approach

Developing the idea of a logical method with prerequisites and consequences, let us ask ourselves the question - what is the probability that $m$ apple that does not belong to our sample $S$ will be tasty, provided the measured values of color and size? In the theory of probability notation, this question can be written as:

$p (Y = 1 | X = x_m) =?$

In this expression can be interpreted $X$ like the parcel $Y$ as a result, but the transition from the premise to the effect will be subject to probabilistic laws, and not logical. Those. instead of a truth table with Boolean values 0 and 1 for a class, there will be probability values that take values from 0 to 1. Apply the Bayes formula and get the following expression:

$p (Y = 1 | X = x_m) = \ frac {p (X = x_m | Y = 1) p (Y = 1)} {p (X = x_m)}$

Consider the right side of this expression in more detail. Factor $p (Y = 1)$ is called a priori probability and means the probability of meeting a tasty apple among all possible apples. A priori chance of meeting a tasteless apple $p (Y = 0) = 1 -p (Y = 1)$ . This probability may reflect our personal knowledge of how delicious and tasteless apples are distributed in nature. For example, in our past experience, we know that 80% of all apples are tasty. Or we can estimate this value simply by counting the share of delicious apples in our list with historical data S. The next factor is $p (X = x_m | Y = 1)$ shows how likely it is to get a specific color and size value $x_m$ for an apple of class 1. This expression is also called a likelihood function and may have the form of some specific distribution, for example, normal. Denominator $p (X = x_m)$ we use as a normalization constant, what would be the desired probability $p (Y = 1 | X = x_m)$ varied from 0 to 1. Our ultimate goal is not a search for probabilities, but a search for a decision rule that would immediately give us a class. The final form of the decision rule depends on what values and parameters are known to us. For example, we can know only the values of a priori probability, and the remaining values cannot be estimated. Then the decisive rule will be this - to put all the apples of the value of the class for which the a priori probability is greatest. Those. if we know that 80% of apples in nature are tasty, then we put a class 1 for each apple. Then our error will be 20%. If, moreover, we can estimate the values of the likelihood function $ p (X = x_m | Y = 1) $, then we can find the value of the desired probability $p (Y = 1 | X = x_m)$ according to the Bayesian formula, as written above. The decisive rule here would be: put a label of the class for which the probability $p (Y = 1 | X = x_m)$ maximum:

$a (x_m) = 1 if \ p (Y = 1 | X = x_m) & gt; p (Y = 0 | X = x_m),$

$a (x_m) = 0 if \ p (Y = 1 | X = x_m) & lt; p (Y = 0 | X = x_m)$

This rule is called the Bayesian classifier. Since we are dealing with probabilities, even the large probability $p (Y = 1 | X = x_m)$ does not guarantee that the apple $x_m$ does not belong to class 0. We estimate the probability of an error on an apple $x_m$ as follows: if the decision rule returns a class value equal to 1, then the probability of a mistake will be $p (Y = 0 | X = x_m)$ and vice versa:

$p (error | x_m) = p (Y = 0 | X = x_m), if \ a (x_m) = 1,$

$p (error | x_m) = p (Y = 1 | X = x_m), if \ a (x_m) = 0$

We are interested in the probability of a classifier error not only in this particular example, but in general for all possible apples:

$p (error) = \ int {p (error | x) p (x) dx}$

This expression is mathematical expect error $p (error | x)$ . So, solving the original problem, we came to the Bayes classifier, but what are its disadvantages? The main problem is to estimate the conditional probability from the data. $p (X = x_m | Y = 1)$ . In our case, we represent the object as a pair of numbers — color and size, but in more complex tasks, the dimension of features may be several times higher and the number of observations from our list with historical data may not be enough to estimate the probability of a multidimensional random variable. Next, we will try to generalize our notion of a classifier error, as well as see if it is possible to choose any other classifier to solve the problem.

Losses from classifier errors

Suppose we already have some decisive rule $a (x)$ . Then it can make two types of errors - the first is to classify an object to class 0, which has real class 1 and vice versa, classify an object to class 1, which has real class 0. In some problems it is important to distinguish these cases. For example, we suffer more from the fact that the apple, marked as tasty, turned out to be tasteless and vice versa. The degree of our discomfort from deceived expectations, we formalize the concept $\ it is a loss.$ More generally, we have a loss function that returns a number for each classifier error. Let be $\ alpha$ - real class label. Then the loss function $\ lambda (\ alpha_i | a (x))$ returns the loss value for a real class label $\ alpha_i$ and the values of our decision rule $a (x)$ . An example of the use of this function - we take from the apple with a known class $\ alpha$ passing an apple to our decisive rule $a (x)$ , we obtain the class estimate from the decision rule, if the values $a (x)$ and $\ alpha$ coincided, we believe that the classifier was not mistaken and there are no losses, if the values do not match, then our function will say the magnitude of the losses $\ lambda (\ alpha_i | a (x)).$

Conditional and Bayesian Risk

Now that we have a loss function and we know how much we lose from improperly classifying an object $x$ , it would be nice to understand how much we lose on average, at many sites. If we know the value $p (Y = 1 | X = x_m)$ - the probability that $m$ if the apple is tasty, provided that the measured values of color and size, as well as the real value of the class (for example, take an apple from sample S, see the beginning of the article), we can introduce the concept of conditional risk. Conditional risk is the average loss at the facility $x$ for decision rule $a (x)$ :

In our case of binary classification when $a (x) = 1$ it turns out:

When a (x) = 0

To calculate the average losses on all possible objects, the concept of Bayesian risk is introduced, which is the mathematical expectation of conditional risk:

$R (a (x)) = \ int {R (a (x) | x) p (x) dx}$

Above, we described the decision rule that relates an object to the class that has the highest probability value. $p (Y | X).$ Such a rule delivers a minimum to our average losses (Bayesian risk), so the Bayesian classifier is optimal from the point of view of the risk functional introduced by us. This means that the Bayesian classifier has the smallest possible classification error.

Some typical loss functions

One of the most frequently occurring loss functions is a symmetric function, when the losses from the first and second types of errors are equivalent. For example, the loss function 1-0 (zero-one loss) is defined as:

$\ lambda (\ alpha = 1 | a (x) = 0) = 1$

$\ lambda (\ alpha = 0 | a (x) = 1) = 1$

$\ lambda (\ alpha = 1 | a (x) = 1) = 0$

$\ lambda (\ alpha = 0 | a (x) = 0) = 0$

Then the conditional risk for a (x) = 1 is simply the probability value to get class 0 on the object. $x$ :

$R (a (x) = 1 | x) = 1 * p (Y = 0 | X = x) + 0 * p (Y = 1 | X = x) = p (Y = 0 | X = x)$

Similarly for a (x) = 0:

$R (a (x) = 0 | x) = p (Y = 1 | X = x)$

The loss function 1–0 takes the value 1 if the classifier makes an error on the object and 0 if it does not. Now we will do so that the value on the error is not equal to 1, but to another function Q, depending on the decision rule and the real class label:

$\ lambda (\ alpha = 0 | a (x) = 1) = Q (\ alpha = 0, a (x) = 1)$

$\ lambda (\ alpha = 1 | a (x) = 0) = Q (\ alpha = 1, a (x) = 0)$

$\ lambda (\ alpha = 1 | a (x) = 1) = 0$

$\ lambda (\ alpha = 0 | a (x) = 0) = 0$

Then the conditional risk can be written as:

$R (a (x) = 0 | x) = Q (\ alpha = 1, a (x) = 0) p (Y = 1 | X = x)$

$R (a (x) = 1 | x) = Q (\ alpha = 0, a (x) = 1) p (Y = 0 | X = x)$

Notation notes

The previous text was written according to the notation adopted in the book of Duda and Hart. In the original book of V.N. Vapnik considered such a process: nature selects an object according to the distribution of $ p (x) $, and then gives it a class label according to the conditional distribution of $ p (y | x) $. Then the risk (expectation of loss) is defined as

$R = \ int {L (y, a (x)) p (x, y) dxdy}$

Where $a (x)$ - the function we are trying to approximate an unknown dependency, $L (y, a (x))$ - loss function for real value $y$ and the values of our function $a (x)$ . This notation is more obvious in order to introduce the following concept - an empirical risk.

Empirical risk

At this stage, we have already found out that the logical method does not suit us, because it is not flexible enough, and we cannot use the Bayesian classifier when there are a lot of signs, and there are a limited number of data for training and we cannot recover the probability $p (X | Y)$ . We also know that the Bayes classifier has the smallest possible classification error. Since we cannot use the Bayes classifier, let's take something simpler. Let's fix some parametric family of functions H and we will select a classifier from this family.

Example: Let $H$ set of all functions of the form

$h (x) = \ sum_ {i} ^ {n} {w_ix_i}$

All functions of this set will differ from each other only by coefficients. $w$ When we chose this family, we assumed that in the color-size coordinates between points of class 1 and points of class 0, we can draw a straight line with coefficients $w$ in such a way that points with different classes are on opposite sides of a line. It is known that a straight line of this type has a coefficient vector $w$ is normal to straight. Now we do this - we take our apple, measure its color and size from it and put a point with the obtained coordinates on the graph in color-size axes. Next, measure the angle between this point and the vector $ w $. We note that our point can lie either on one or on the other side of the line. Then the angle between $w$ and the point will be either sharp or dull, and the scalar product will be either positive or negative. Hence the decisive rule:

$a (x) = sign (\ sum_ {i} ^ {n} {w_ix_i})$

After we fixed the class of functions $ H $, the question arises - how to choose from it a function with the necessary coefficients? Answer - let's choose the function that delivers the minimum to our Bayesian risk $ R () $. Again, the problem is that in order to calculate the Bayesian risk values, you need to know the distribution of $ p (x, y) $, but it is not given to us, and restoring it is not always possible. Another idea is to minimize the risk not at all possible sites, but only at a sample. Those. minimize function:

$R_ {emp} = \ frac {1} {N} \ sum_ {i} ^ {N} {L (y_i, a (x_i))}$

This feature is called empirical risk. The next question is why we decided that minimizing the empirical risk, we also minimize the Bayesian risk? Let me remind you that our task is practical - to allow as few classification errors as possible. The fewer errors, the less Bayesian risk. The justification for the convergence of empirical risk to Bayesian with increasing data volume was obtained in the 70s by two scientists - V.N. Vapnik and A. Ya. Chervonenkis.

Guarantees of convergence. The simplest case

So, we have come to the conclusion that the Bayes classifier gives the smallest possible error, but in most cases we cannot train it and we cannot calculate the error (risk) either. However, we can calculate the approximation to Bayesian risk, which is called empirical risk, and knowing the empirical risk to choose such an approximating function that would minimize the empirical risk. Let's consider the simplest situation when minimizing empirical risk is provided by the classifier, which also minimizes Bayesian risk. For the simplest case, we will have to make an assumption, which is rarely carried out in practice, but which can be further weakened. Fix a finite class of functions. $H = \ {h: X \ rightarrow Y \}$ from which we will choose our classifier and suppose that the real function that nature uses to mark our apples to taste is in this finite set of hypotheses: $f \ in H$ . We also have a sample. $S$ derived from the distribution $D$ above objects $x$ . All objects in the sample are assumed to be equally independently distributed (iid). Then the following will be true

Theorem

Choosing a function from class $H$ by minimizing empirical risk, we are guaranteed to find such $h \ in H$ that it has a small Bayes risk if the sample on which we perform the minimization is of sufficient size.

For the meaning of “small value” and “sufficient size”, see the literature below.

Idea proof

By the condition of the theorem we get the sample $S$ from distribution $D$ i.e. the process of selecting objects from nature is random. Every time we collect a sample, it will be from the same distribution, but the objects themselves can be different in it. The main idea of the proof is that we can get such an unsuccessful sample. $S$ that the algorithm we choose by minimizing the empirical risk on a given sample would be poor to minimize the Bayesian risk but it would be good to minimize the empirical risk, but the probability of obtaining such a sample is small and as the sample size increases, this probability decreases. Similar theorems exist for more realistic assumptions, but here we will not consider them.

Practical results

Having evidence that the function found while minimizing the empirical risk will not have a big error on previously unobservable data with sufficient training sample size, we can use this principle in practice, for example, as follows - take the expression:

$R_ {emp} = \ frac {1} {N} \ sum_ {i} ^ {N} {L (y, a (x))}$

And we substitute different loss functions, depending on the problem being solved. For linear regression:

$L (y, a (x)) = (y-a (x)) ^ 2$

For logistic regression:

$L (y, a (x)) = log (1 + exp [-y * a (x)])$

Despite the fact that the method of support vectors is mainly geometrical motivation, it can also be presented as the problem of minimizing empirical risk.

Conclusion

Many teaching methods with a teacher can be viewed as particular cases of the theory developed by V.N. Vapnik and A. Ya. Chervonenkis. This theory provides guarantees regarding the error on the test sample provided that the training sample is sufficiently sized and some hypothetical space requirements in which we search for our algorithm.

Used Books

The Nature of the Statistical Learning Theory, Vladimir N. Vapnik
Pattern Classification, 2nd Edition, Richard O. Duda, Peter E. Hart, David G. Stork
Understanding Machine Learning: From Theory to Algorithms, Shai Shalev-Shwartz, Shai Ben-David

PS Please write in a personal about all inaccuracies and typos

PPS Many thanks to the authors of this article for the TeX-editor

Source: https://habr.com/ru/post/277241/

All Articles