📜 ⬆️ ⬇️

How to make binary classifier work a little bit better

Disclaimer: A post is written based on this . I suspect that most readers are well aware of how the Naive Bayes classifier works, so I propose only a glimpse of what it says before going under the cat.

Problem solving using machine learning algorithms has long been firmly established in our lives. This happened for all understandable and objective reasons: cheaper, simpler, faster than explicitly coding the algorithm for solving each individual problem. Usually, the “black boxes” of classifiers (it is unlikely that the same VC will offer you its corpus of marked names), which does not allow them to be fully managed.
Here I would like to talk about how to try to achieve the “best” results of the binary classifier, what characteristics the binary classifier has, how to measure them, and how to determine that the result of the work has become “better”.

Theory. Short course

1. Binary classification

Let be - many objects - finite set of classes. Categorize object - use mapping . When , such a classification is called binary , because we have only 2 options at the output. For simplicity, we will further assume that , because absolutely any problem of binary classification can lead to this. Usually , the result of mapping an object into a class is a real number, and if it is above a given threshold , then the object is classified as positive , and its class is 1.

2. Table of contingency binary classifier

Obviously, the predicted class of the object obtained by mapping can either match in a real class or not. Total 4 options in the amount. This is very clearly demonstrated by this tablet:

If we know the quantitative values ​​of each of the assessments - we know all that is possible about this classifier and can dig further.
(I deliberately do not use the terms of the type "error of the first kind" because it seems to me not obvious and unnecessary)
We will continue to use the following notation:

3. Characteristics of a binary classifier

Accuracy (precision) - shows how many of the predicted positive objects turned out to be really positive.

Fullness (recall) - shows how much of the total number of real positive objects, it was predicted as a positive class.

These two characteristics are essential for any binary classifier. Which of the characteristics is more important - it all depends on the task. For example, if we want to create a “school search engine,” then it will be in our interest to remove “non-childish” content from the issue, and here completeness is much more important than accuracy. In the case of determining the gender of the name - we are more interested in accuracy than completeness.
F-measure (F-measure) - a characteristic that allows you to give an estimate at the same time for accuracy and completeness.

Coefficient sets the balance of precision and completeness. When The F-measure gives equal weight to both characteristics. This F-measure is called balanced, or

False Positive Rate ( ) - shows how many of the total number of real negative objects turned out to be incorrectly predicted.


4. ROC curve and its AUC

ROC curve - a graph that allows you to assess the quality of the binary classification. The graph shows the dependence of TPR (completeness) on FPR with varying threshold. At the point (0,0), the threshold is minimal, and TPR and FPR are also minimal. An ideal case for a classifier is the passage of a graph through a point (0,1). Obviously, the graph of this function is always monotonically non-decreasing.

AUC (Area Under Curve) - this term (the area under the graph) gives a quantitative characteristic of the ROC curve: the more the better. AUC is equivalent to the probability that the classifier will assign a greater value to a randomly selected positive object than to a randomly chosen negative object. That is why it was previously said that, as a rule , the positive class is rated higher than the negative.
When AUC = 0.5 , then this classifier is equal to random. If AUC <0.5 , then you can simply flip the values ​​produced by the classifier.
')

Cross validation

There are many methods of cross validation (assessing the quality of a binary classifier), and this is a topic for a separate article. Here I just want to consider one of the most popular methods in order to understand how this thing works at all, and why it is needed.
Of course, it is possible to build a ROC curve for any sample. However, the ROC curve constructed from the training set will be shifted left-up due to retraining. To avoid this and get the most objective assessment, cross-validation is used.
K-fold cross validation - the pool is split into k folds, then each fold is used for the test, while the remaining k-1 folds are used for training. The value of the parameter k can be arbitrary. In this case, I used it to be 10. For the given gender classifier of the name, the following AUC results were obtained (get_features_simple is one significant letter, get_features_complex is 3 significant letters)
foldget_features_simpleget_features_complex
00.9780110.962733
one0.967910.944097
20.9634620.966129
30.9663390.948452
four0.9465860.945479
five0.9498490.989648
60.9599840.943266
70.9790360.958863
eight0.9864690.951975
90.9620570.980921
avg0.96597030.9591563

Practice

1. Preparation

The entire repository is here .
I took the same tagged file and replaced it with “m” for 1, “f” - 0. We’ll use this pool for training, like the author of the previous article. Armed with the first page of issuing a favorite search engine and awk, I rivet the list of names that are not in the original. This pool will be used for the test.
Changed the classification function so that it returns the probabilities of positive and negative classes, and not logarithmic indicators.
Classification function
def classify2(classifier, feats): classes, prob = classifier class_res = dict() for i, item in enumerate(classes.keys()): value = -log(classes[item]) + sum(-log(prob.get((item, feat), 10**(-7))) for feat in feats) if (item is not None): class_res[item] = value eps = 709.0 posVal = '1' negVal = '0' posProb = negProb = 0 if (abs(class_res[posVal] - class_res[negVal]) < eps): posProb = 1.0 / (1.0 + exp(class_res[posVal] - class_res[negVal])) negProb = 1.0 / (1.0 + exp(class_res[negVal] - class_res[posVal])) else: if class_res[posVal] > class_res[negVal]: posProb = 0.0 negProb = 1.0 else: posProb = 1.0 negProb = 0.0 return str(posProb) + '\t' + str(negProb) 

2. Implementation and use

As written in the title, my task was to make the binary classifier work better than it works by default. In this case, we want to learn how to determine the gender of a name, better than the probability of Naive Bayes 0.5. For this, this simplest utility was written. It is written in C ++, because gcc is everywhere. The implementation itself is nothing interesting, it seems to me. With a key -? or --help you can read the help, I tried to describe it in as much detail as possible.
And now, in fact, what we were going to: the assessment of the classifier and its tuning. At the output, nbc.py creates a sheet from the files with the results of the classification, I use them directly below. For our purposes, we will clearly see the graphs of accuracy from the threshold, completeness from the threshold and F-measure from the threshold. They can be built as follows:
 $ ./OptimalThresholdFinder -A 3 -P 1 < names_test_pool.txt_simple -x thr -y prc -p plot_test_thr_prc_simple.txt $ ./OptimalThresholdFinder -A 3 -P 1 < names_test_pool.txt_simple -x thr -y tpr -p plot_test_thr_tpr_simple.txt $ ./OptimalThresholdFinder -A 3 -P 1 < names_test_pool.txt_simple -x thr -y fms -p plot_test_thr_fms_simple.txt $ ./OptimalThresholdFinder -A 3 -P 1 < names_test_pool.txt_simple -x thr -y fms -p plot_test_thr_fms_0.7_simple.txt -a 0.7 

For educational purposes, I made 2 F-measures from the threshold, with different weights. The second weight was chosen 0.7, because in our problem we are more interested in accuracy than completeness. The default graph is based on 10,000 different points, which is a lot for such simple data, but these are uninteresting optimization subtleties. In the same way, by constructing graph data for get_features_complex, we obtain the following results:


From the graphs it becomes obvious that the classifier shows the best results by no means at the threshold of 0.5. The graph of the F-measure clearly demonstrates that the "complex feature" gives the best result on all the variation of the threshold. This is quite logical, given that it is also "complicated." We obtain the threshold values ​​at which the F-measure reaches a maximum:
 $ ./OptimalThresholdFinder -A 3 -P 1 < names_test_pool.txt_simple --target fms --argument thr --argval 0 Optimal threshold = 0.8 Target function = 0.911937 Argument = 0.8 $ ./OptimalThresholdFinder -A 3 -P 1 < names_test_pool.txt_complex --target fms --argument thr --argval 0 Optimal threshold = 0.716068 Target function = 0.908738 Argument = 0.716068 
Agree, these values ​​are much better than those that turned out to be at the threshold of 0.5.

Conclusion

With such simple manipulations, we were able to choose the optimal Naive Bayes threshold to determine the gender of the name that works better than the default threshold. This raises a reasonable question, how do we determine that it works “better”? I have repeatedly mentioned that accuracy is more important to us in this task than completeness, but the question of how important it is is very, very difficult, so a balanced F-measure was used to evaluate the classifier’s performance, which in this case can be an objective indicator quality.
What is much more interesting, the results of the binary classifier based on the “simple” and “complex” features turned out to be approximately the same, differing in the value of the optimal threshold.

Source: https://habr.com/ru/post/228963/


All Articles