📜 ⬆️ ⬇️

Basics of analyzing data in python using pandas + sklearn

Good afternoon, dear readers. In today's post, I will continue my series of articles on analyzing data in python using the Pandas module and tell you one of the options for using this module in conjunction with the module for machine learning scikit-learn . The work of this bundle will be shown on the example of the task about the rescued from the "Titanic". This task is very popular among people just starting to do data analysis and machine learning .


Formulation of the problem


So, the essence of the task is to build a model using machine learning methods that predicts whether a person will be saved or not. 2 files are attached to the task:

As it was written above, for the analysis modules Pandas and scikit-learn will be needed. With Pandas, we will conduct an initial analysis of the data, and sklearn will help in calculating the predictive model. So, first, load the required modules:
In addition, explanations are given for some fields:


Input analysis


> So, the task is formed and you can begin to solve it.
First, let's load a test sample and see how it looks like:
')
from pandas import read_csv, DataFrame, Series data = read_csv('Kaggle_Titanic/Data/train.csv') 

PassengerIdSurvivedPclassNameSexAgeSibspParchTicketFareCabinEmbarked
one03Braund, Mr. Owen harrismale22one0A / 5 211717.2500NaNS
2oneoneCumings, Mrs. John Bradley (Florence Briggs Th ...female38one0PC 1759971.2833C85C
3one3Heikkinen, Miss. Lainafemale2600STON / O2. 31012827.9250NaNS
fouroneoneFutrelle, Mrs. Jacques Heath (Lily May Peel)female35one011380353.1000C123S
five03Allen, Mr. William Henrymale35003734508.0500NaNS

It can be assumed that the higher the social status, the greater the likelihood of salvation. Let's check it out by looking at the number of survivors and drowning depending on the grade. For this you need to build the following summary:

 data.pivot_table('PassengerId', 'Pclass', 'Survived', 'count').plot(kind='bar', stacked=True) 

image
Our above assumption is that the higher their social status among passengers is, the higher is their likelihood of salvation. Now let's take a look at how the number of relatives affects the fact of salvation:

 fig, axes = plt.subplots(ncols=2) data.pivot_table('PassengerId', ['SibSp'], 'Survived', 'count').plot(ax=axes[0], title='SibSp') data.pivot_table('PassengerId', ['Parch'], 'Survived', 'count').plot(ax=axes[1], title='Parch') 

image
As can be seen from the graphs, our assumption was again confirmed, and not many of the people with more than 1 relatives were saved.
Now we speculate about the data that are cabin numbers. Theoretically, there may not be any data about user cabins, so let's look at this field as much as this one is filled in:

 data.PassengerId[data.Cabin.notnull()].count() 


As a result, only 204 records and 890 are filled in, on the basis of this it can be concluded that this field can be omitted during analysis.
The next field that we will analyze will be a field with age ( Age ). Look at how full it is:

 data.PassengerId[data.Age.notnull()].count() 


This field is almost all filled (714 non-empty records), but there are empty values ​​that are not defined. Let's give it a value equal to the median by age of the entire sample. This step is needed for more accurate model building:

 data.Age = data.Age.median() 

We have left to deal with the fields Ticket , Embarked , Fare , Name . Let's look at the Embarked field, in which the landing port is located, and check if there are any passengers whose port is not listed:

 data[data.Embarked.isnull()] 

PassengerIdSurvivedPclassNameSexAgeSibspParchTicketFareCabinEmbarked
62oneoneIcard, Miss. Ameliefemale280011357280B28NaN
830oneoneStone, Mrs. George Nelson (Martha Evelyn)female280011357280B28NaN


So we found 2 such passengers. Let's assign these passengers the port in which the village has the most people:

 MaxPassEmbarked = data.groupby('Embarked').count()['PassengerId'] data.Embarked[data.Embarked.isnull()] = MaxPassEmbarked[MaxPassEmbarked == MaxPassEmbarked.max()].index[0] 


Well, we have dealt with one more field and now we have fields with the passenger's name, ticket number and ticket price.
In fact, we need only the price ( Fare ) of these three fields, since it determines to some extent the ranking within the classes of the Pclass field. That is, for example, people inside the middle class can be divided into those who are closer to the first (upper) class, and who are closer to the third (lower) class. Let's check this field for empty values ​​and if any, we will replace the price with the median at the price of all the samples:

 data.PassengerId[data.Fare.isnull()] 

In our case there are no empty entries.
In turn, the ticket number and the passenger's name will not help us in any way, since this is just reference information. The only thing for which they can be useful is the definition of which of the passengers are potentially relatives, but since people who have relatives almost did not escape (this was shown above), we can ignore this data.
Now, after removing all unnecessary fields, our set looks like this:

 data = data.drop(['PassengerId','Name','Ticket','Cabin'],axis=1) 

SurvivedPclassSexAgeSibspParchFareEmbarked
03male28one07.2500S
oneonefemale28one071.2833C
one3female28007.9250S
oneonefemale28one053.1000S
03male28008.0500S


Input Preprocessing


The preliminary analysis of the data is completed, and according to its results, we have obtained a kind of sample, which contains several fields and it would seem possible to break the construction of the model, if not for one “but”: our data contain not only numerical, but also text data.
Therefore, before building a model, you need to encode all our text values.
You can do it manually, or by using the sklearn.preprocessing module. Let's use the second option.
You can encode a list with fixed values ​​using the LabelEncoder () object. The essence of this function is that the input to it is a list of values, which must be encoded, the output is a list of classes whose indices are the codes of the elements of the input list.

 from sklearn.preprocessing import LabelEncoder label = LabelEncoder() dicts = {} label.fit(data.Sex.drop_duplicates()) #     dicts['Sex'] = list(label.classes_) data.Sex = label.transform(data.Sex) #       label.fit(data.Embarked.drop_duplicates()) dicts['Embarked'] = list(label.classes_) data.Embarked = label.transform(data.Embarked) 

As a result, our initial data will look like this:
SurvivedPclassSexAgeSibspParchFareEmbarked
03one28one07.25002
oneone028one071.28330
one3028007.92502
oneone028one053.10002
03one28008.05002


Now we need to write the code to bring the verification file in the desired form. To do this, you can simply copy the pieces of code that were above (or just write a function to process the input file):

 test = read_csv('Kaggle_Titanic/Data/test.csv') test.Age[test.Age.isnull()] = test.Age.mean() test.Fare[test.Fare.isnull()] = test.Fare.median() #      MaxPassEmbarked = test.groupby('Embarked').count()['PassengerId'] test.Embarked[test.Embarked.isnull()] = MaxPassEmbarked[MaxPassEmbarked == MaxPassEmbarked.max()].index[0] result = DataFrame(test.PassengerId) test = test.drop(['Name','Ticket','Cabin','PassengerId'],axis=1) label.fit(dicts['Sex']) test.Sex = label.transform(test.Sex) label.fit(dicts['Embarked']) test.Embarked = label.transform(test.Embarked) 


The code described above performs almost the same operations as we did with the training sample. The difference is that a line was added to process the Fare field, if it is not filled out.
PclassSexAgeSibspParchFareEmbarked
3one34.5007.8292one
3047.0one07.00002
2one62.0009.6875one
3one27.0008.66252
3022.0oneone12.28752


Construction of classification models and their analysis


Well, the data is processed and you can start building a model, but first you need to decide how we will check the accuracy of the model obtained. For this test, we will use sliding controls and ROC curves . We will perform the test on the training sample, after which we apply it to the test sample.
So consider a few machine learning algorithms:

Let's load the libraries we need:

 from sklearn import cross_validation, svm from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc import pylab as pl 

To begin with, it is necessary to divide our training sample into an indicator that we study, and its defining characteristics:

 target = data.Survived train = data.drop(['Survived'], axis=1) #    Id        kfold = 5 #    itog_val = {} #        

Now our learning sample looks like this:
PclassSexAgeSibspParchFareEmbarked
3one28one07.25002
one028one071.28330
3028007.92502
one028one053.10002
3one28008.05002

Now we divide the indicators obtained earlier into 2 subsamples (training and test) for calculating ROC curves (for sliding control, this is not necessary, since the verification function does it itself. The cross_validation module’s train_test_split function will help us with this:

 ROCtrainTRN, ROCtestTRN, ROCtrainTRG, ROCtestTRG = cross_validation.train_test_split(train, target, test_size=0.25) 

As parameters, it is passed to:

At the output, the function returns 4 arrays:
  1. New learning array of parameters
  2. test array of parameters
  3. New array of indicators
  4. test array of indicators


Below are the listed methods with the best parameters selected by experience:
 model_rfc = RandomForestClassifier(n_estimators = 70) #   -  model_knc = KNeighborsClassifier(n_neighbors = 18) #   -  model_lr = LogisticRegression(penalty='l1', tol=0.01) model_svc = svm.SVC() #  kernek='rbf' 

Now check the resulting models using the sliding control. To do this, we need to use the cross_val_score function .
 scores = cross_validation.cross_val_score(model_rfc, train, target, cv = kfold) itog_val['RandomForestClassifier'] = scores.mean() scores = cross_validation.cross_val_score(model_knc, train, target, cv = kfold) itog_val['KNeighborsClassifier'] = scores.mean() scores = cross_validation.cross_val_score(model_lr, train, target, cv = kfold) itog_val['LogisticRegression'] = scores.mean() scores = cross_validation.cross_val_score(model_svc, train, target, cv = kfold) itog_val['SVC'] = scores.mean() 

Let's look at the graph of the average cross-test tests for each model:

 DataFrame.from_dict(data = itog_val, orient='index').plot(kind='bar', legend=False) 

image

As you can see from the graph, the RandomForest algorithm showed itself best of all. Now let's take a look at the graphs of ROC-curves, to assess the accuracy of the classifier. We will draw graphs using the matplotlib library:

 pl.clf() plt.figure(figsize=(8,6)) #SVC model_svc.probability = True probas = model_svc.fit(ROCtrainTRN, ROCtrainTRG).predict_proba(ROCtestTRN) fpr, tpr, thresholds = roc_curve(ROCtestTRG, probas[:, 1]) roc_auc = auc(fpr, tpr) pl.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % ('SVC', roc_auc)) #RandomForestClassifier probas = model_rfc.fit(ROCtrainTRN, ROCtrainTRG).predict_proba(ROCtestTRN) fpr, tpr, thresholds = roc_curve(ROCtestTRG, probas[:, 1]) roc_auc = auc(fpr, tpr) pl.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % ('RandonForest',roc_auc)) #KNeighborsClassifier probas = model_knc.fit(ROCtrainTRN, ROCtrainTRG).predict_proba(ROCtestTRN) fpr, tpr, thresholds = roc_curve(ROCtestTRG, probas[:, 1]) roc_auc = auc(fpr, tpr) pl.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % ('KNeighborsClassifier',roc_auc)) #LogisticRegression probas = model_lr.fit(ROCtrainTRN, ROCtrainTRG).predict_proba(ROCtestTRN) fpr, tpr, thresholds = roc_curve(ROCtestTRG, probas[:, 1]) roc_auc = auc(fpr, tpr) pl.plot(fpr, tpr, label='%s ROC (area = %0.2f)' % ('LogisticRegression',roc_auc)) pl.plot([0, 1], [0, 1], 'k--') pl.xlim([0.0, 1.0]) pl.ylim([0.0, 1.0]) pl.xlabel('False Positive Rate') pl.ylabel('True Positive Rate') pl.legend(loc=0, fontsize='small') pl.show() 

image
As can be seen from the results of ROC analysis, the best result again showed RandomForest. Now it remains to apply our model to the test sample:

 model_rfc.fit(train, target) result.insert(1,'Survived', model_rfc.predict(test)) result.to_csv('Kaggle_Titanic/Result/test.csv', index=False) 


Conclusion


In this article I tried to show how you can use the pandas package in conjunction with the sklearn machine learning package . The resulting model with Kaggle submission showed an accuracy of 0.77033. In the article, I wanted to show more precisely how to work with the tools and the progress of the research, rather than building a detailed algorithm, such as in this series of articles.

Source: https://habr.com/ru/post/202090/


All Articles