This article is a review, compilation from several sources, a full list of which I will give at the end. Feature selection is an important component of machine learning. Therefore, I wanted to better understand its various methods. I got great pleasure from searching for information, reading articles, watching lectures. And I want to share these materials with you. I tried to write an article so that it required minimal knowledge in the field and was accessible to beginners.
In the text everywhere there is a speech about training with the teacher. The data set for training a model is called a training set. Independent variables are called features, I call the dependent variable the target variable. I assume that these words are familiar to the reader.
Feature selection
Why the selection of features is generally necessary. There are two main reasons. Firstly, if there are a lot of features, then the classifier's operation time increases. If the goal is to test several classifiers in order to select the best one, then the time required for the calculations can be simply tremendous. In addition, the data (training set) can no longer fit into the RAM, then you have to modify the algorithms of the classifiers. Even one line of the set may no longer fit, although this is already a rare case.
The main reason is still the second - with an increase in the number of features, the accuracy of prediction often decreases. Especially if there are a lot of garbage features in the data (little correlated with the target variable). This phenomenon is called
overfitting .
')
Feature selection methods fall into three categories: filter methods, wrapper methods and embedded methods.
The first category I will call “filtering methods”, the last one - “built-in methods”, and for the second I don’t have an adequate translation (I’ll listen to the suggestions).Filtering methods
They are based on statistical methods and, as a rule, treat each feature independently. Allow to evaluate and rank features by importance, for which the degree of correlation of this feature with the target variable is taken. Consider a few examples.
Informaition gain
One example of filtering methods is informaition gain. It is closely related to the concept of information entropy. Formula entropy is quite simple:
Where, p (x
i ) is the probability that the variable X will take the value x
i . Under our conditions, this probability is counted as the number of records (examples) in which X = x
i divided by the total number of records.
To better understand the meaning of this measure, two simple examples can be presented. First, tossing a coin, in which the falling of an eagle and tails are equally probable. In this case, the entropy calculated by the formula will be equal to 1. If the coin always falls solely upwards with an eagle, then the entropy will be equal to 0. In other words, high entropy speaks of a uniform distribution, low - of some more interesting.
To calculate the correlation between variables, we need to define a few more measures. The first one is specific conditional entropy:
- entropy H (Y) calculated only for those records for which X = x
i .
Relative entropy (conditional entropy) is considered as:
Such an interesting value is not in itself, but its difference with the usual entropy of the feature Y. That is, how would a measure of how much more ordered Y becomes for us, if we know the values ​​of X. Or, to put it simply, is there a correlation between the values ​​of X and Y, and how big is it? This is indicated by the magnitude of information gain:
The larger the IG parameter, the stronger the correlation. Thus, we can easily calculate the information gain for all features and throw out those that have little effect on the target variable. Thereby, firstly, by reducing the time for calculating the model, and, secondly, by reducing the risk of retraining.
Confusion with Mutual information and Information gainIn
Wikipedia, the formula indicated above is called mutual informaition, and
informaition gain is used as a synonym for “Kullback – Leibler distance”. But in most cases, information gain and mutual information are still used as different names for the same. Therefore, the formula above can meet you under any of these names. I will call this measure informaition gain simply because I am so used to it.
Chi-square
Consider another popular feature filtering method, called the chi-square test. In order to understand it you will need to recall a couple of formulas from the theory of probability. The first of these is the formula for the probability of intersection (multiplication) of events. Those. the probability that both events A and B will occur:
Where P (A / B) is the probability that event A will occur if B has already arrived. If these events are independent (the onset of one of them does not affect the probability of the onset of the other), then:
Based on this formula, we can calculate the expected probability of the occurrence of both events A and B at the same time, if we assume that they are independent. And then calculate how much reality is different from our expectations. The chi-square formula looks like this:
Consider its use by example. Suppose we want to investigate the effect of some influence on the occurrence of a particular disease. The table with the statistics we have is as follows:
Disease |
Impact | there is | Not | Total |
It was | 37 | 13 | 50 |
Did not have | 17 | 53 | 70 |
Total | 54 | 66 | 120 |
Where the cell at the intersection of the first row and first column represents the number of exposed and diseased; the first row and second column - the number of exposed, but not sick, etc.
Calculate the expected value for the first cell (those exposed and ill):
Similarly for other cells. And by the formula we calculate chi-square (in this case it is equal to 29.1).
Thus, for independent events, the chi-square parameter will be zero (or a number close to it). But in order to understand exactly what the probability of getting such a picture for two independent events, another concept is introduced - the degree of freedom. It is defined as:
(# values ​​of variable1 - 1) * (# values ​​of variable_2_1 - 1)
Where # variable_values1 is the number of values ​​that variable 1 can take (for our case, the degree of freedom = 1).
In order to have a chi-square value and the degree of freedom one could estimate the probability, there are special tables (like this one (
https://www.easycalculation.com/statistics/chisquare-table.php )).
We got some insight into how the algorithm works. But, of course, in reality, it will not be necessary either to write this algorithm independently, much less to read the statistics manually. The scikit-learn library for python makes it possible not to think about the implementation details:
from nltk import WordNetLemmatizer from sklearn.feature_selection import chi2 from sklearn.feature_selection import SelectKBest select = SelectKBest(chi2, k=50) X_new = select.fit_transform(train_data_features, train["sentiment"])
In my last
article, you can find an example of the effectiveness of using chi-square statistics for solving the NLP problem.
mRmR
Separately, I want to briefly discuss a more complex filtering method, which takes into account not only the correlation between features and the target variable, but also feature redundancy - mRmR (minimum redundancy with maximum relevance). The method tries to maximize the following expression:
Where the first term is responsible for maximizing the correlation between the selected feature set S and the target variable Y (similar to the informaition gain method), and the second for minimizing the correlations between the features. The feature set thus obtained is not only relevant, but the features in this set minimally repeat each other. In this method, features are added to the set one by one with a selection of the optimal one at each step.
Pros and cons of filtering methods
What is this class of methods good for? They have a low computation cost, which depends linearly on the total number of features. They are much faster in both wrapper and embedded methods. In addition, they work well even when the number of features we have exceeds the number of examples in the training set (the methods of other categories cannot always boast).
What are their disadvantages? They view each feature in isolation. Because of this, to find the top N most correlated features generally does not mean to get a subset on which the prediction accuracy will be the highest. Consider a simple example.
Suppose there is a certain array of features, among which are X1 and X2. The target variable depends on them as:
(logical XOR function)
The truth table will look like this (if anyone has forgotten):
X1 | X2 | Y |
---|
0 | 0 | 0 |
0 | one | one |
one | 0 | one |
one | one | 0 |
Looking at this table, we will construct a table with statistics for the variable X1 and calculate its correlation with the variable Y (for X2 it will be similar) using the chi-square formula.
Y |
---|
X1 | one | 0 | Total |
one | one | one | 2 |
0 | one | one | 2 |
Total | 2 | 2 | four |
It is easy to calculate that the chi-square will be equal to 0, that is, it will not show any correlation between the features and the target variable.
The example is exaggerated, but it shows well that the filtering methods are not able to catch the joint influence of several features on the target variable.
Wrapper methods
The essence of this category of methods is that the classifier runs on different subsets of the features of the original training set. Then a subset of features with the best parameters on the training set is selected. And then it is tested on a test set (the test set does not participate in the process of choosing the optimal subset).
There are two approaches in this class of methods — forward selection methods and backwards selection features. The first ones start from an empty subset, where various features are gradually added (to select the optimal addition at each step). In the second case, the method starts with a subset equal to the original feature set, and features are gradually removed from it, with the recalculation of the classifier each time.
One example of such methods is recursive feature elimination. As the name implies, it refers to the algorithms of gradual exclusion of features from the general pool. On python, the implementation of this algorithm is in the scikit-learn library. This method requires you to choose a classifier by which features will be evaluated, for example, linear regression:
from sklearn.feature_selection import RFE from sklearn.linear_model import LinearRegression data= load_data() X = data["data"] Y = data["target"] lr = LinearRegression()
Although the elimination method better tracks the relationship between features, it is much more expensive computationally. However, all wrapper methods require much more computation than filtering methods. Moreover, in the case of a large number of features and a small size of the training set, these methods have the danger of retraining.
Embedded methods
And finally, the built-in methods, which allow not to separate the selection of features and training of the classifier, but select within the model calculation process. In addition, these algorithms require less computation than wrapper methods (although more than filtering methods).
The main method from this category is regularization. There are various varieties of it, but the basic principle is common. If we consider the work of the classifier without regularization, then it consists in building such a model that would be the best way to predict all points of the training set.
For example, if the classification algorithm is linear regression, then the polynomial coefficients are selected, which approximates the relationship between the features and the target variable. The mean square error (
RMSE ) serves as an assessment of the quality of the selected coefficients. Those. the parameters are chosen so that the total deviation (more precisely, the total square of deviations) for the points predicted by the classifier from the real points is minimal.
The idea of ​​regularization is to construct an algorithm that minimizes not only the error, but also the number of variables used.
Tikhonov's regularization method (ridge regression)
Consider all the same on the example of linear regression. If the test set contains the matrix of features A and the vector of the target variable b, then we are looking for a solution in the form Ax = b. In the process of the algorithm, the following expression is minimized:
Where the first term is just the root-mean-square error, and the second is the regularizing operator (the sum of the squares of all coefficients multiplied by alpha). In the course of the algorithm's operation, the sizes of the coefficients will be proportional to the importance of the corresponding variables, and those variables that give the smallest contribution to eliminating the error will be near-zero.
A few words about the alpha parameter. It allows you to customize the contribution of the regularizing operator to the total amount. With it, we can specify the priority - the accuracy of the model or the minimum number of variables used.
If we want to use linear regression with Tikhonov's regularization, then there is no need to reinvent the wheel. The scikit-learn library has a model called
Ridge regression , which includes this type of regularization.
from sklearn.linear_model import Ridge data= load_data() X = data["data"] y = data["target"] clf = Ridge(alpha=1.0) clf.fit(X, y)
Note the ability to manually adjust the alpha setting.
Lasso
It is similar to the previous one except for the difference in the regularizing operator. It is not a sum of squares, but a sum of modules of coefficients. Despite the small differences, properties differ. If in a ridge as alpha grows, all coefficients get values ​​all closer to zero, but usually they still do not vanish. That in LASSO with increasing alpha, more and more coefficients become zero and completely stop contributing to the model. Thus, we get a really selection of features. More significant features will keep their coefficients non-zero, less significant features will be reset. You can hear more about these properties and look at the graphics (and learn about Elastic Net for one, which I will not dwell on) for example in this
lecture .
Using this method with the scikit-learn library is also identical to the previous method. Only Ridge is replaced by
Lasso .
Thus, regularization is a kind of penalty for the excessive complexity of the model, which allows you to protect yourself from overtraining if there are garbage features among them. You should not think that regularization happens only in linear models, and for boosting and for neural networks there are own regularization methods.
From the drawbacks of regularization, we can note the fact that the model is built on the entire array of features, which means that it does not speed up the classifier’s work. But in general, this method is better able to catch the interdependencies of variables than filtering methods.
Finally
I will not write here conclusions about the criteria for choosing one method or another in a particular situation. In most cases, it will be easier and more convenient to refer to the built-in methods. If visibility is needed - filtering methods can provide it. The rest, I think, is a matter of practice.
I would be glad to hear comments. If, in your opinion, there is an inaccuracy in the text, something is missing, something is incomprehensible, you want to share your practical observations - write.
Links
stats.stackexchange.com/questions/13389/information-gain-mutual-information-and-related-measureswww.coursera.org/course/mmdswww.cs.binghamton.edu/~lyu/publications/Gulgezen-etal09ECML.pdfhabrahabr.ru/company/mailru/blog/254897machinelearningmastery.com/an-introduction-to-feature-selectionocw.jhsph.edu/courses/fundepiii/pdfs/lecture17.pdfblog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-sideai.stanford.edu/~ronnyk/wrappersPrint.pdfwww.math.kent.edu/~reichel/publications/modtikh.pdfscikit-learn.org/stable/modules/linear_model.html