Differences LabelEncoder and OneHotEncoder in SciKit Learn

If you have recently begun your journey of machine learning, you can get confused between LabelEncoder and OneHotEncoder . Both encoders are part of the SciKit Learn library in Python and both are used to convert categorical or textual data into numbers that our predictive models understand better. Let's find out the differences between the coders on a simple example.

Feature coding

First of all, the SciKit Learn documentation for LabelEncoder can be found here . Now consider the following data:

Data from SuperDataScience

In this example, the first column (country) is completely text. As you may already know, we cannot use text in the data to train the model. Therefore, before we can begin the process, we need to prepare these data.

And to convert such categories into understandable numerical data models, we use the LabelEncoder class. Thus, all we need to do to get the attribute for the first column is to import the class from the sklearn library, process the column with the fit_transform function , and replace the existing text data with the new ones encoded. Let's see the code.

from sklearn.preprocessing import LabelEncoder labelencoder = LabelEncoder() x[:, 0] = labelencoder.fit_transform(x[:, 0])

It is assumed that the data is in the variable x . After running the code above, if you check the value of x , you will see that the three countries in the first column were replaced by the numbers 0, 1 and 2.

In general, this is the coding of signs. But depending on the data, this conversion creates a new problem. We translated a set of countries into a set of numbers. But this is only categorical data, and in fact there is no connection between the numbers.

The problem here is that, since different numbers are in the same column, the model will incorrectly think that the data is in some special order - 0 <1 <2 Although this, of course, is not at all the case. To solve the problem, we use OneHotEncoder .

Onehostencoder

If you are interested in reading the documentation, you can find it here . Now, as we have already discussed, depending on the data we have, we may encounter a situation where, after coding features, our model becomes confused, falsely assuming that the data is related by order or hierarchy, which is not really there. To avoid this, we will use OneHotEncoder .

This encoder takes a column with categorical data that has been pre-coded into a sign, and creates several new columns for it. Numbers are replaced by ones and zeros, depending on which column has which value. In our example, we will get three new columns, one for each country - France, Germany and Spain.

For rows whose first column is France, the column “France” will be set to “1” and the other two columns to “0”. Similarly, for rows whose first column is Germany, the “Germany” column will have “1”, and the other two columns will have “0”.

This is done quite simply:

 from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder(categorical_features = [0]) x = onehotencoder.fit_transform(x).toarray()

In the constructor, we specify which column the OneHotEncoder should be processed, in our case, [0] . Then we transform the x array using the fit_transform function of the encoder object that we just created. That's it, now we have three new columns in the data set:

As you can see, instead of one column with the country, we received three new ones that encode this country.

This is the difference from LabelEncoder and OneHotEncoder .

Source: https://habr.com/ru/post/456294/

All Articles

Differences LabelEncoder and OneHotEncoder in SciKit Learn

Feature coding

Onehostencoder

More articles: