Preparing data for analysis correctly

In machine learning tasks, the quality of models is very dependent on data.
But the data itself in real-world problems is rarely perfect. As a rule, there are not many data, the number of parameters available for analysis is limited, there are noises and gaps in the data. But to solve the problem somehow needed.

I want to share practical experience in successfully solving machine learning problems. And give a simple set of steps to squeeze the maximum out of the data.

Solving data analysis tasks consists of two major steps:
')

Data preparation.
Construction on the prepared data models.

In practice, the quality of the resulting models is much more dependent on the quality of the prepared data than on the choice of the model itself and its optimization.

For example, XGBoost can give an improvement in the quality of a model of the order of 5% compared to a random forest, a neural network up to 3% compared to XGBoost. Optimization, regularization and selection of hyperparameters can still add 1-5%.

But simply by adding the informational signs extracted from the same data that already exists, you can immediately get up to 15% increase in the quality of the model.

Building a feature space

Retrieving features is an expansion of the space of informational signs with new data that may be useful for improving the quality of the model, but which the model cannot extract by itself.

Modern machine learning algorithms, such as neural networks, are able to independently find nonlinear patterns in the data. But for this to happen, there must be a lot of data. Sometimes a lot. This is not always the case. And then we can help our model.

In my work, I adhere to the following basic principles:

1. find all possible characteristics of the objects described by the model;
2. do not make assumptions about the importance of parameters derived from these parameters;
3. retrieved parameters must be understood.

I will tell about each item in more detail.

The data on which we teach the model are objects of the real world. Initially, we do not have vectors and tensors. All we have is a complex description of each object in the sample. These can be, for example, telephone number, color of the package, height and even smell.

Everything is important to us. And from each of these complex features, you can extract digital information.

We extract all the digital information that can somehow characterize every aspect of our object.

Once this approach was considered a bad practice. Linear models could not work with correlated parameters, since this led to poor conditionality of matrices, unlimited growth of weights. Today, the problem of multicollinearity is practically exhausted through the use of advanced algorithms and regularization methods. If you have the height and weight of a person - take both of these parameters. Yes, they are correlated, but multicollinearity is in the past. Just use modern algorithms and regularization.

So consider every aspect of your object and find all numerical characteristics. At the end, look again and think. Have you missed something?

I will give a couple of examples.

Suppose you have phone numbers. It would seem useless information. But on the phone number you can say a lot of things. You can find out the region of the owner of the number, to which operator the number belongs, the frequency of the operator in the region, the relative volume of the operator and much more. Knowing the region, you can add many parameters that characterize it depending on the task you are solving.

If you have information about the packaging, then you know its geometric dimensions. The geometrical characteristics include not only height, width and depth, but also their relationships - they also describe dimensions. Packaging material, a variety of colors, their brightness and much, much more.

Examine the ranges of variable values of each extracted feature. In some cases, for example, the logarithm of a parameter will work much better than the parameter itself. Since logarithm is a characteristic of order. If you have large variations in the range of values, be sure to log the parameters.

If you have a periodicity in the parameters, use trigonometric functions. They can give a very rich set of additional features. For example, when one of the characteristics of your object are closed curves, the use of trigonometric functions is mandatory.

Use external sources. The only limitation in the use of external sources should be the cost of extracting them relative to the budget of the problem being solved.

Do not make assumptions about the importance of parameters extracted from these.
No matter what expertise we have in the subject area, we do not know all the statistical laws. I never cease to wonder how sometimes seemingly unimportant things at first glance improve the quality of the model and go on top of features importance. In the end, you will have many signs that do not work at all. But you do not know in advance what tricky combinations of parameters that are not important at first glance will work well.

The signs that you extract, as a rule, will not work alone. And you will not find a correlation with the target variable of each parameter separately. But together they will work.

And finally, do not litter the space of informational signs with meaningless features. This seems contradictory to what has been written above, but there is a nuance - common sense.
If the information somehow describes the object - it is useful. If you just took and multiplied all the features in pairs, then, most likely, you did not add any sense, but you have squared the dimension of the feature space.

Sometimes you can meet the advice in pairs to multiply features. And it can really work if you have a linear model. In this way, you add non-linearity and improve the separability of the feature space. But modern algorithms, especially neural networks, do not need such an artificial and meaningless addition of nonlinearity.
However, if you have enough computing power, you can do it yourself and check it out.

If you have a large set of unmarked data and a small one, you can add features using training without a teacher. Autocoders work well.

Restore order in the data

When the data is collected, you need to restore order.

It may happen that some components of your attribute space will be constant or have very little variability, which has no statistical significance. Throw them away without regret.

Check cross correlations. We set the threshold for the absolute value of mutual correlations of 0.999. It may be different in your tasks. But some signs can simply be expressed linearly through each other. In this case, you need to leave only one. In the case of direct linear dependence, it makes no sense to leave both parameters from the correlating pair. I note that this is not just a function of function.

And finally, count the features importance. This should be done for two reasons.

First, frankly weak informational signs can unproductively load your computational resources without introducing useful information.

Secondly, you need to find the most important signs and analyze them.

It is not necessary to remove informational signs. Now there are fairly good methods for teaching arrogant models. The cost is the computation time.
But the most important signs need to be carefully considered. In fact, dragging the target variable into the feature space is much easier than it seems at first glance. Especially if the origin of the data is not fully controlled by you.
If you see such a picture on your chart of the importance of signs,

then this may not be a reason for joy, but a reason for the complete elimination of features from the space of attributes.

TL; DR

Extract all the data that can be extracted, but use common sense.
Do not try to include an expert, prematurely removing signs.

Use functional expressions from your informational signs if they are justified.

Delete statistically insignificant variables and variables strongly correlated with others.

Make a diagram of the importance of signs. Perhaps remove the most unimportant.
Learn the most important.

If the most important ones stand out from the rest, study them very closely. Build distribution graphs. Try to understand why they affect so much. Consider removing them.

If you have the opportunity to test your model not only on test, but also on real data. Check it by first removing suspiciously important parameters and then turning them on. And compare the results.

The recommendations given here depend on the algorithm used to build the model. I usually use neural networks. These recommendations are not exactly right for you if you use logistic and linear regressions.

The article does not address the extensive topic of data collection. Try to understand how the data was collected for analysis. In particular, pay attention to the formation of the target variable.

The volume of the article does not allow to touch on all aspects, but I tried to outline the main points.
Most machine learning publications are focused on the description of algorithms. But collecting and preparing data is 95% of the work on building a model. I hope, my note will help you to step on a rake less often.

And what methods of improving the quality of the models do you use?
Author - Valery Dmitriev rotor
Thanks MikeKosulin for editing :)

Source: https://habr.com/ru/post/342366/

All Articles

Preparing data for analysis correctly

Building a feature space

Restore order in the data

TL; DR

More articles: