⬆️ ⬇️

Art Feature Engineering in Machine Learning

Hi, Habr!







In the previous article ( “Introduction to machine learning using Python and Scikit-Learn” ), you and I learned the basic steps of solving machine learning problems. Today we will talk more about the techniques that significantly increase the quality of the developed algorithms. One of these techniques is Feature Engineering . Immediately, we note that this is a kind of art, which can only be learned by solving a huge number of tasks. Nevertheless, some common approaches are developed with experience, which I would like to share in this article.



So, as we already know, almost any task begins with the creation ( Engineering ) and the selection ( Selection ) of features. The methods of feature selection have been studied quite well and there are already a large number of algorithms for this (we will talk about them in more detail next time). But the task of creating attributes is a kind of art and falls entirely on the shoulders of Data Scientist. It is worth noting that it is precisely this task that is often the most difficult in practice and it is thanks to the successful selection and creation of features that very high-quality algorithms are obtained. Often, simple algorithms with well-selected features win on kaggle.com by themselves (excellent examples are the Heritage Provider Network Health Prize or the Feature Engineering and Classifier for the KDD Cup 2010 )

')

Probably the most famous and clear example of Feature Engineering, many of you have already seen in the Andrew Ng course . An example was the following: with the help of linear models, the price of a house is predicted depending on a number of features, among which there are such as house length and width. In this case, linear regression predicts the price of a house as a linear combination of width and length. But after all, any sane person understands that the price of a house primarily depends on the area of ​​the house, which is not expressed in any way through a linear combination of length and width. Therefore, the quality of the algorithm significantly increases if the length and width notice on their product. Thus, we get a new sign, which most strongly affects the price of the house, as well as reduce the dimension of the attribute space. In my opinion, this is the simplest and most vivid example of creating features. Note that it is very difficult to come up with a method that would give a technique for constructing signs for any given task. That is why the post is called “Art Feature Engineering”. Nevertheless, there are a number of simple methods and techniques that I would like to share from my own experience:



Categorical signs



Suppose our objects have attributes that take values ​​on the final set. For example, a color ( color ), which may be blue, red ( red ), green ( green ), or its value may be unknown. In this case, it is useful to add features like is_red , is_blue , is_green , is_red_or_blue and other possible combinations.



Dates and times



If among signs there is a date or time - as a rule, it often helps to add signs corresponding to the time of day, the amount of past time from a certain moment, the selection of seasons, seasons, quarters. It also helps to divide the time into hours, minutes and seconds (if the time is given in Unix-Time or ISO format). There are lots of options in this place, each of which is chosen for a specific task.



Numeric variables



If a variable is real, its rounding or division into the whole and real part (with subsequent normalization) often helps. Alternatively, it often helps to bring a numeric attribute into a categorical one. For example, if there is such a sign as a mass, then you can enter signs of the form "mass is greater than X" , "mass from X to Y" .



String Character Processing



If there is a sign, the value of which is a finite number of lines - then you should not forget that the lines themselves often contain information. A good example is the task of Titanic: Machine Learning from Disaster , in which the names of the swim participants had the prefixes “Mr.” , “Mrs.” and “Miss.” , From which it is easy to extract the gender sign.



Results of other algorithms



Often, as a sign, you can also add the result of the work of other algorithms. For example, if the classification problem is solved, you can first solve the auxiliary clustering problem, and take the cluster of the object as a feature in the initial problem. This usually occurs on the basis of primary data analysis in the case where objects are well clustered.



Aggregated features



It also makes sense to add features that aggregate features of some object, thereby also reducing the dimension of the feature description. As a rule, this is useful in tasks in which one object contains several parameters of the same type. For example, a person who has several cars of different value. In this case, you can consider the signs corresponding to the maximum / minimum / average cost of the car of this person.



Adding new features



This point should rather be attributed more to practical tasks from real life than to machine learning competitions. In more detail, this will be a separate article; now we only note that in order to effectively solve a problem, it is necessary to be an expert in a specific field and to understand what influences a specific target variable. Returning to the example of the price of an apartment, everyone knows that the price depends primarily on the area, however, it is rather difficult to draw such conclusions in a more complex subject area.



So, we have considered several techniques for creating ( Engineering ) signs in machine learning tasks that can help to significantly increase the quality of existing algorithms. Next time we will talk more about the selection methods ( Selection ) of features. Fortunately, everything will be easier there, because there are already developed techniques for selecting signs, while creating signs, as the reader has probably noticed, is an art!

Source: https://habr.com/ru/post/248129/



All Articles