📜 ⬆️ ⬇️

Sample Feature Engineering in Machine Learning

Hi, Habr!



In one of the previous articles, we met with such a concept as Feature Engineering and its application in practice. The commentary voiced a wish to show with an example how the art of generating features helps to significantly improve the quality of machine learning algorithms. I looked for tasks in which it could be clearly demonstrated and found one good example. This is the task of Forest Cover Type Prediction . Let us show how you can use simple ideas that do not contain machine learning to immediately get into the top 10% Leader Board !

We will not describe in detail the condition of the problem, those who wish can familiarize themselves with it on the competition page. Having moved away from the subject domain, we say only that the classical problem of a multi-class classification of objects is being solved. We will not also give simple solutions in this place that immediately come to mind - launching standard algorithms on the whole set of features or with a preliminary selection of the latter - all this gives Accuracy no more than 0.6-0.7
')
Let's forget for the time being about complex machine learning algorithms, but just look at the data that is in this problem. To begin with, take a look at the distribution of several signs (the other signs are binary)



As you can see, the signs have quite different distributions (it is proposed to discuss their physical meaning to the reader), however, one interesting point can be noticed - the Vertical_Distance_To_Hydrology feature has one of the tails out of the value 0. Thus, recalling the review on Feature Engineering, it is suggested to introduce a new binary feature , which will be equal to the value of the logical expression [Vertical_Distance_To_Hydrology <0]

In addition, it often happens in practice that it is useful to simply include some common sense and think about what really can affect the target variable in a particular task (as was noted in the example of housing prices in one of the previous posts ), namely:

It is clear that if there are such signs as Horizontal_Distance_To_Hydrology and Vertical_Distance_To_Hydrology , then the first thing that suggests itself is simply to calculate the distance - the root of the sum of the squares and write it as a new sign of Distanse_to_Hydrolody

It is also obvious that if there are distances from one object to some two others, then (in this task) one can also consider “relative distances” as signs, namely, add the following pair / amount as the signs:

Horizontal_Distance_To_Hydrology
Horizontal_Distance_To_Roadways
Horizontal_Distance_To_Fire_Points

And that, having received more +6 new signs. The reader is invited to think a little in this place and suggest some more options on how to generate new ones from the available signs.

And now the most interesting and visual, which is why I chose this task. This happens quite rarely and it can be found only by looking at the data for a long time. But, noticing the non-obvious pattern, it is possible to significantly increase the quality of the algorithm. Let's look carefully at the dependence of Elevation on Vertical_Distance_To_Hydrology :



The reader probably got the feeling that the data is in a certain sense streamlined and fairly well separated. This is just one of the most beautiful patterns in this problem, which is not so difficult to notice. Now it is clear that it is necessary to consider the difference between Elevation and Vertical_Distance_To_Hydrology as one of the key features. It is such findings that often make it possible to extract the maximum from the available data. In fact, it is already possible to build a sufficiently high quality classifier. In this case, we have not yet applied any machine learning technician.

I will try to explain why this works well. In fact, the algorithms of machine learning as a whole (rather roughly) solve the optimization problem in one way or another. In the case of a classification problem, this comes down to finding the best dividing surface. Exactly this is now in the feature space of Elevation and Vertical_Distance_To_Hydrology, which we did. We can draw 6 dividing lines and get the first approximation of our classifier

In order not to deprive the reader of interest in this task, let us dwell on the search for new non-obvious patterns that are present in it. Using only the above signs, and building them an ensemble of random trees, you can get the quality ( Accuracy ) close to 0.8 . With this example, I wanted to show how it is often useful in practice to simply look at the data before applying complex algorithms, optimize parameters, and write code!

Source: https://habr.com/ru/post/249759/


All Articles