There are many posts and resources in the network that teach us to deal with class imbalances in the classification task. Usually they offer sampling methods: artificially duplicate observations from a rare class, or throw out some observations from a popular class. With this post, I want to clarify that the “curse” of class imbalance is a myth that is important only for certain types of tasks.
To begin with, not all machine learning models work poorly with unbalanced classes. Most probabilistic models are weakly dependent on class balances. Problems usually arise when we move on to non-probabilistic or multi-class classification.
In logistic regression (and its generalizations - neural networks), the balance of classes strongly influences the free term, but very little - on the slope coefficients. Indeed, the predicted odds ratio from binary logistic regression changes to a constant when the balance of classes changes, and this effect goes to a free member .
In decision trees (and their generalizations — random forest and gradient boosting), the imbalance of classes affects the measures of leaf heterogeneity (impurity), but this effect is approximately proportional to all candidates for the next breakdown (split), and therefore usually doesn’t really affect the breakdown .
On the other hand, non-probabilistic models like SVM class imbalances can seriously affect. SVM builds a learning hyperplane so that approximately the same number of positive and negative examples are on the dividing line or on the wrong side of it. Therefore, a change in the balance of classes can affect this number, and hence the position of the border.
When we use probabilistic models for binary classification, everything is OK: during training, the models do not depend much on the balance of classes, and during testing we can use metrics that are not sensitive to the balance of classes. Such metrics (for example, ROC AUC) depend on the predicted class probabilities, and not on the “hard” discrete classification.
However, ROC AUC type metrics are not well generalized for multi-class classification, and we usually use simple accuracy for evaluating multi-class models. Accuracy has known problems with class imbalances: it is based on a “hard” classification, and can completely ignore rare classes. This is where many practitioners turn to sampling. However, if you remain true to probabilistic predictions, and using likelihood (it’s also cross-entropy) to assess the quality of a model, the imbalance of classes can be experienced without sampling.
Sampling makes sense if you do not need a probabilistic classification. In this case, the true distribution of classes is simply irrelevant for you, and it can be distorted as you please. Imagine a task where you do not need to know the probability that the cat in front of you is in the picture, but only that this picture looks more like a picture with a cat than a picture with a dog. In such a setting, it may be desirable for you that cats and dogs have the same number of “votes”, even if in the training set the cats constituted the overwhelming majority.
In other tasks (such as identifying fraud, predicting clicks, or my favorite credit scoring), you actually need not a “hard” classification, but a ranking: which customers are more prone to fraud, click or default than others? In this case, the class balance is not important at all, since decision thresholds are still usually manually selected, based on economic considerations such as expected losses.
However, in such tasks it is often useful to predict the "true" probability of fraud (or click, or default), and in this case sampling, distorting these probabilities, is undesirable. This is how, for example, credit default scoring models are constructed — a gradient boost or neuron is built on unbalanced data, and then it is checked for a long time that the predicted default probabilities coincide with the actual ones in different sections.
Therefore, think three times before worrying about the imbalance of classes and trying to “fix” it - maybe it is better to spend precious time on feature engineering, parameter selection, and other equally important steps in your data analysis.
Source: https://habr.com/ru/post/349078/
All Articles