Why do we need the Ho-Kashyap algorithm?

Recently, a publication appeared on Habré about the Ho-Kashyap algorithm (the Ho-Kashyap procedure, which is also the NSCO algorithm, the smallest root-mean-square error). It seemed to me not very clear and I decided to sort out the topic myself. It turned out that in the Russian-speaking Internet the topic was not well disassembled, so I decided to issue an article based on the search results.

Despite the boom of neural networks in machine learning, linear classification algorithms remain much simpler to use and interpret. But at the same time, sometimes you don’t want to use any advanced methods, such as the support vector or logistic regression method, and the temptation is to drive all the data into one large linear least-squares regression , even more so it knows how to build MS Excel.

The problem with this approach is that even if the input data is linearly separable, the resulting classifier may not separate them. For example, for a set of points $X = [(6, 9), (5, 7), (5, 9), (10, 1)]$ , $y = [1, 1, -1, -1]$ get the dividing line $(0.15x_1 - 0.43x_2 + 3.21) = 0$ (example borrowed from (1)):
')

The question is - is it possible to somehow get rid of this particular behavior?

Linear classification task

To begin, we formalize the subject of the article.

Dana matrix $X$ each line $X_i$ which corresponds to the characteristic description of the object $i$ (including constant $one$ ) and class label vectors $y$ where $y_i \ in \ {+ 1, -1 \}$ - object label $i$ . We want to build a linear classifier of the form. $sign (Xw) = y$ .

>>> import numpy as np >>> X = np.array([[6, 9, 1], [5, 7, 1], [5, 9, 1], [0, 10, 1]]) >>> y = np.array([[1], [1], [-1], [-1]])

The easiest way to do this is to construct a least-squares regression for $Xw = y$ that is, minimize the sum of squared deviations $\ min \ sum (w X_i - y_i) ^ 2$ . The optimal weight can be found by the formula $w = x ^ {+} y$ where $X ^ {+}$ - pseudoinverse matrix. Thus, a picture from the beginning of the article.

 >>> w = np.dot(np.linalg.pinv(X), y) >>> w array([[ 0.15328467], [-0.4379562 ], [ 3.2189781 ]]) >>> np.dot(X, w) array([[ 0.19708029], [ 0.91970803], [ 0.04379562], [-1.16058394]])

Linear separability

For convenience of writing, we multiply each line of inequality element by element. $Xw = y$ on $y$ and call the resulting matrix on the left $Y = y \ times X$ (here $\ times$ means line by line multiplication). Then the condition for OLS-regression is reduced to the form $Yw = 1$ , and the minimization problem - to minimize $\ min \ sum (w Y_i - 1) ^ 2$ .

 >>> Y = y * X >>> Y array([[ 6, 9, 1], [ 5, 7, 1], [ -5, -9, -1], [ 0, -10, -1]])

At this point, you can recall that the condition for the separation of classes is $sign (Xw) = y$ or $sign (Yw) = 1$ and since we want to separate classes, we need to solve this problem.

We introduce a vector $b$ , wherein $b_i$ responsible for the distance from the element $i$ to the dividing line ( $Yw = b$ ). Since we want all elements to be classified correctly, we introduce the condition $b & gt; 0$ . Then the task will be reduced to $\ min \ sum (w Y_i - b_i) ^ 2$ and will decide how $w = Y ^ {+} y$ . You can manually pick up such values $b$ that the resulting plane will divide our sample:

 >>> b = np.ones([4, 1]) >>> b[3] = 10 >>> w = np.dot(np.linalg.pinv(Y), b) >>> np.dot(Y, w) array([[ 0.8540146 ], [ 0.98540146], [ 0.81021898], [ 10.02919708]])

Ho-Kashyap Algorithm

The Ho-Kashyap algorithm is designed to pick $b$ automatically. Algorithm diagram ( $k$ - step number $b ^ {(0)}$ usually taken equal $one$ ):

Calculate the OLS-regression coefficients ( $w ^ {(k)} = Y ^ {+} b ^ {(k)}$ ).
Calculate the indentation vector $b ^ {(k + 1)}$ .
If the decision is not agreed ( $b ^ {(k)} \ neq b ^ {(k + 1)}$ ), then repeat step 1.

I want to calculate the indentation vector in some way, like $b ^ {(k + 1)} = \ max (Yw ^ {(k)}, 0)$ because it minimizes the loss function $\ min \ sum (w Y_i - b_i) ^ 2$ . Unfortunately, the condition $b & gt; 0$ does not allow us to do this, and instead it is proposed to take a step on the positive part of the loss function gradient $e ^ {(k)} = b ^ {k} - Yw ^ {(k)}$ : $b ^ {(k + 1)} = b ^ {(k)} - \ mu (e ^ {(k)} - | e ^ {k} |)$ where $\ mu$ - step gradient descent, decreasing in the course of solving the problem.

 >>> e = -np.inf * np.ones([4, 1]) >>> b = np.ones([4, 1]) >>> while np.any(e < 0): ... w = np.dot(np.linalg.pinv(Y), b) ... e = b - np.dot(Y, w) ... b = b - e * (e < 0) ... >>> b array([[ 1.], [ 1.], [ 1.], [ 12.]]) >>> w array([[ 2.], [-1.], [-2.]])

In the case of a linearly separable sample, the algorithm always converges and converges to the separating plane (if all the elements of the gradient are $b$ non-negative, they are all zero).

In the case of a linearly inseparable sample, the loss function can be arbitrarily small, since it suffices to multiply $b$ and $w$ on a constant to change its value and, in principle, the algorithm may not converge. The search does not give any specific recommendations on this topic.

Connection of the Ho-Kashyap algorithm and linear SVM

It can be noted that if the object is classified correctly, then the error in the stated optimization problem ( $\ min \ sum_ {i} (w Y_i - b_i) ^ 2$ ) the error on it can be reduced to zero. If the object is classified incorrectly, then the minimum error on it is equal to the square of its indent from the dividing line ( $(w Y_i) ^ 2$ ). Then the loss function can be rewritten in the form:

$\ min_ {w} \ sum_ {i} \ max (0, 1 - w Y_i) ^ 2 - 2 \ max (0, 1 - w Y_i)$

In turn, the loss function of a linear SVM is:

$\ min_ {w} \ lambda || w || ^ 2 + \ frac {1} {N} \ sum_ {i} \ max (0, 1 - w Y_i)$

Thus, the problem solved by the Ho-Kashyap algorithm is an analogue of SVM with a quadratic loss function (it penalizes farther for emissions far from the dividing plane) and ignores the width of the dividing strip (that is, it’s not looking for a plane as far as possible from the nearest properly classified elements, and any dividing plane).

Multidimensional case

It may be recalled that the OLS-regression is an analogue of Fisher’s two-class linear discriminant (their solutions coincide up to a constant). The Ho-Kashpyap algorithm can also be applied to the case $K$ classes - in this $w$ and $b$ become matrices $W$ and $B$ size $D \ times K$ and $N \ times K$ accordingly where $D$ - the dimension of the problem, and $N$ - number of objects. In this case, the columns corresponding to the wrong classes should have negative values.

Thanks

parpalak for a convenient editor.
rocket3 for the original article.

Links

(1) http://www.csd.uwo.ca/~olga/Courses/CS434a_541a/Lecture10.pdf
(2) http://research.cs.tamu.edu/prism/lectures/pr/pr_l17.pdf
(3) http://web.khu.ac.kr/~tskim/PatternClass Lec Note 07-1.pdf
(4) A.E. Lepsky, A.G. Bronevich Mathematical methods of pattern recognition. Lecture course
(5) Tu J., Gonzalez R. Principles of Pattern Recognition
(6) R.Duda, P.Hart Pattern Recognition and Scene Analysis

Source: https://habr.com/ru/post/312774/

All Articles