Kaggle and Titanic - another solution using Python
I want to share my experience with the task of the famous Kaggle machine learning competition. This competition is positioned as a competition for beginners, and I just had almost no practical experience in this area. I knew a bit of theory, but I almost didn’t deal with real data and did not work closely with python. In the end, after spending a couple of New Year's Eve evenings, I scored 0.80383 (the first quarter of the rating).
')
Titanic
We include suitable for the work of music and begin the study.
I would also like to note an article about another competition. From it you can understand exactly how the brain of the researcher should work and that most of the time should be devoted to preliminary analysis and data processing.
To solve the problem, I use the Python-technology stack. This approach is not the only possible one: there are R, Matlab, Mathematica, Azure Machine Learning, Apache Weka, Java-ML and I think the list can be continued for a long time. Using Python has a number of advantages: there are really a lot of libraries and they are of excellent quality, and since most of them are wrappers over C-code, they are also quite fast. In addition, the constructed model can be easily put into operation.
I must admit that I am not a very big fan of scripting non-strictly typed languages, but the wealth of libraries for python does not allow it to be ignored in any way.
We will run everything under Linux (Ubuntu 14.04). You need: python 2.7, seaborn, matplotlib, sklearn, xgboost, pandas. In general, only pandas and sklearn are required, and the rest are needed for illustration.
Under Linux, libraries for Python can be installed in two ways: by the regular package manager (deb) manager or via the Python utility pip.
Installing deb packages is easier and faster, but often the libraries are outdated there (stability is above all).
So how is it better to install packages? I use a compromise: I put massive and requiring multiple dependencies to build NumPy and SciPy from DEB-packages.
If I have forgotten something, then all the necessary packages are usually easily calculated and installed in a similar way.
Users of other platforms need to take similar steps to install packages. But there is a much simpler option: there are already precompiled distributions with python and almost all the necessary libraries. I have not tried them myself, but at first glance they look promising.
Frankly speaking, we don’t have a lot of data - only 891 passengers in the train-sample, and 418 in the test-sample (one line goes to the header with a list of fields).
Open train.csv in any tabular processor (I use LibreOffice Calc) to visually see the data.
$ libreoffice --calc train.csv
We see the following:
Not everyone's age is filled
Tickets have a strange and inconsistent format.
Names have a title (miss, mr, mrs, etc.)
There are very few people who have cabin numbers ( there is a chilling story about why)
In the existing rooms of the cabins, the deck code is apparently registered (as it turned out )
Also, according to the article , the side is encrypted in the cabin room.
Sort by name. It is evident that many traveled in families, and the scale of the tragedy is visible - often the families were separated, only a part survived.
Sort by ticket. It can be seen that several people traveled along the same ticket code at once, often with different surnames. A quick glance seems to show that people with the same ticket number often share the same fate.
Some passengers do not have a landing port
It seems about everything is clear, we go directly to working with data.
Data loading
In order not to make noise, the next code will immediately cite all used imports:
Script header
# coding = utf8
import pandas as pd import numpy as np import matplotlib.pyplot as plt import xgboost as xgb import re import seaborn as sns from sklearn.linear_model import LinearRegression from sklearn.linear_model import LogisticRegression from sklearn.linear_model import SGDClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.grid_search import GridSearchCV from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.feature_selection import SelectKBest, f_classif from sklearn.cross_validation import StratifiedKFold from sklearn.cross_validation import KFold from sklearn.cross_validation import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn import metrics
pd.set_option ('display.width', 256)
Probably most of my motivation for writing this article is caused by the enthusiasm from working with the pandas package. I knew about the existence of this technology, but I could not even imagine how pleasant it was to work with it. Pandas is Excel on the command line with convenient I / O functionality and tabular data processing.
We collect both samples (train-sample and test-sample) into one total all-sample.
all_data = pd.concat([train_data, test_data])
Why do this, because in the test sample there is no field with a resulting survival flag? The complete sample is useful for calculating statistics for all other fields (averages, medians, quantiles, minima and maxima), as well as the relationships between these fields. That is, considering statistics only for train-sampling, we actually ignore some very useful information for us.
Data analysis
Data analysis in Python can be done in several ways at once, for example:
We will try all three, but first we will launch the simplest text version. We derive survival statistics depending on the class and gender.
print("===== survived by class and sex") print(train_data.groupby(["Pclass", "Sex"])["Survived"].value_counts(normalize=True))
Result
===== survived by class and sex
Pclass Sex Survived
1 female 1 0.968085
0 0.031915
male 0 0.631148
1 0.368852
2 female 1 0.921053
0 0.078947
male 0 0.842593
1 0.157407
3 female 0 0.500000
1 0.500000
male 0 0.864553
1 0.135447
dtype: float64
We see that women were first planted into the boats first — the woman’s survival rate is 96.8%, 92.1% and 50% depending on the class of the ticket. The chance of a man’s survival is much lower and amounts to 36.9%, 15.7% and 13.5% respectively.
With the help of pandas, we quickly calculate a summary of all the numerical fields of both samples - separately for men and for women.
It is seen that in the middle and percentiles everything is completely flat. But for men, samples differ in maximums by age and by ticket price. Women in both samples also have a difference in the maximum age.
Data Digest Build
Let's collect a small digest for the full sample - it will be needed for the further conversion of the samples. In particular, we need the values ​​that will be substituted for the missing ones, as well as various reference books for translating text values ​​into numeric values. The fact is that many classifiers can work only with numbers, so somehow we have to translate categorical attributes into numeric ones, but regardless of the way of conversion, we will need reference books of these values.
fares - a reference book of medians of ticket prices depending on the class of the ticket;
titles - directory titles;
families — a directory of family identifiers (last name + number of family members);
cabins - directory of cabin identifiers;
tickets - directory of ticket identifiers.
We build reference books for recovering missing data (medians) using a combined sample. But reference books for the translation of categorical signs - only for test data. The idea was the following: let's say we have the last name “Ivanov” in the train-set, but there is no this name in the test-set. The knowledge inside the classifier that “Ivanov” survived (or did not survive) does not help in the evaluation of the test set, since this name still does not exist in the test set. Therefore, in the directory add only those names that are in the test-set. An even more correct way would be to add only the intersection of signs to the directory (only those signs that are in both sets) - I tried, but the verification result deteriorated by 3 percent.
Select the signs
Now we need to highlight the signs. As already mentioned, many classifiers can only work with numbers, so we need:
Convert categories to numeric representation
Select implicit signs, that is, those that are not explicitly given (title, deck)
Do something with missing values
There are two ways to convert a categorical feature into a numeric one. We can consider the problem on the example of the passenger floor.
In the first version, we simply change the floor to a certain number, for example, we can replace female with 0, and male with 1 ( roundwheel and wand - very convenient to remember ). This option does not increase the number of signs, however, the “more” and “less” relation now appears inside the sign for its values. In the case when there are many values, such an unexpected property of the feature is not always desirable and can lead to problems in geometric classifiers.
The second conversion option is to have two columns, “sex_male” and “sex_female”. In the case of a male, we will assign sex_male = 1, sex_female = 0. In the case of the female, vice versa: sex_male = 0, sex_female = 1. We now avoid “more” / “less” relations, but now we have more signs, and the more signs, the more data we need to train the classifier — this problem is known as the “curse of dimensionality” . Especially difficult is the situation when there are a lot of attribute values, for example ticket IDs, in such cases, for example, you can fold back rarely occurring values ​​by substituting some special tag instead of them - thus reducing the total number of attributes after the extension.
A small spoiler: we bet first on the Random Forest classifier. Firstly, everyone does this , and secondly, it does not require expansion of features, is resistant to the scale of feature values ​​and is calculated quickly. Despite this, we are preparing the signs in a general universal form, since the main goal set before us is to explore the principles of working with sklearn and possibilities.
Thus, we replace some categorical signs with numbers, some expand, some and replace and expand. We do not save on the number of signs, because in the future we can always choose which ones will be involved in the work.
In most manuals and examples from the network, the original data sets are very freely modified: the original columns are replaced with new values, unnecessary columns are deleted, etc. There is no need for this as long as we have a sufficient amount of RAM: it is always better to add new features to the set without altering the existing data, since later pandas will always allow us to select only the ones we need.
A small explanation of the addition of new features:
add our own cabin index
add our own deck index (cut out from the cabin room)
add your own ticket index
add your own title index (cut out from the name)
add own index of family identifier (form from family name and number of family)
In general, we add to the signs in general everything that comes to mind. It can be seen that some signs duplicate each other (for example, expansion and replacement of gender), some clearly correlate with each other (ticket class and ticket price), some are clearly meaningless (the port of landing is unlikely to affect survival). We will deal with all this later - when we make the selection of signs for training.
Let's transform both available sets and also create a combined set again.
Although we are aiming at using Random Forest, I want to try other classifiers. And with them there is the following problem: many classifiers are sensitive to the scale of features. In other words, if we have one attribute with values ​​from [–10.5] and a second characteristic with values ​​[0, 0,000], then the same percentage error on both signs will lead to a large difference in absolute value and the classifier will interpret the second characteristic as more important.
To avoid this, we reduce all numeric (and we no longer have any other) signs to the same scale [-1,1] and zero mean value. To do this in sklearn can be very simple.
Just put a comment on unnecessary and start training. What exactly is not needed - you decide .
Once again the analysis
Since now we have a column in which the range is recorded in which the age of the passenger falls, we estimate the survival rate depending on the age (range).
print("===== survived by age") print(train_data.groupby(["AgeR"])["Survived"].value_counts(normalize=True)) print("===== survived by gender and age") print(train_data.groupby(["Sex", "AgeR"])["Survived"].value_counts(normalize=True)) print("===== survived by class and age") print(train_data.groupby(["Pclass", "AgeR"])["Survived"].value_counts(normalize=True))
We see that the chances of survival are great for children under 5 years old, and already in old age the chance to survive decreases with age. But this does not apply to women - a woman has a great chance of survival at any age.
Let's try visualization from seaborn - it gives very beautiful pictures, although I am more used to the text.
Here you have an article describing exactly how he does it. Other strategies can be specified in the SelectKBest parameters.
In principle, we already know everything - gender is very important. Titles are important - but they have a strong correlation with sex. The ticket class is important and in some way the F deck.
Grading grade
Before starting any classification, we need to understand how we will evaluate it. In the case of Kaggle contests, everything is very simple: we just read their rules. In the case of the Titanic, the estimate will be the ratio of the correct classifier ratings to the total number of passengers. In other words, this estimate is called accuracy .
But before sending the classification result for the test sample to an assessment in Kaggle, we would be nice to first understand for ourselves at least the approximate quality of our classifier. To understand this, we can only use train-sampling, since only it contains labeled data. But the question remains - how exactly?
Often in examples you can see something like this:
That is, we train the classifier on the train-set, after which we check it on it. Undoubtedly, to some extent, this gives a certain assessment of the quality of the classifier’s work, but in general this approach is incorrect. The classifier should not describe the data on which he was trained, but some model that generated this data. Otherwise, the classifier perfectly adapts to the train-sample, when checking it shows excellent results, but when checking on some other data set it merges with a bang. What is called overfitting .
The correct approach would be to divide the available train-set into a number of pieces. We can take a few of them, train the classifier on them, and then check his work for the rest. You can produce this process several times just by shuffling the pieces. In sklearn, this process is called cross-validation .
You can already imagine in your head the cycles that will share data, produce training and assessment, but the trick is that all you need to implement this in sklearn is to determine a strategy.
Here we define a rather complicated process: the training data will be divided into three pieces, and the records will fall into each piece in a random way (to level the possible dependence on the order), besides the strategy will track the ratio of classes in each piece to be approximately equal. Thus, we will perform three measurements on pieces 1 + 2 vs 3, 1 + 3 vs 2, 2 + 3 vs 1 - after that we will be able to get an average assessment of the accuracy of the classifier (which will characterize the quality of work), as well as the variance of the assessment (which will be characterize the stability of his work).
The linear_scorer method is needed because LinearRegression is a regression that returns any real number. Accordingly, we divide the scale by the border of 0.5 and reduce any numbers to two classes - 0 and 1.
Random Forest won the algorithm and its dispersion is not bad - it seems it is stable.
Even better
Everything seems to be good and you can send the result, but there is only one muddy moment left: each classifier has its own parameters - how can we understand that we have chosen the best option? Without a doubt, you can sit for a long time and sort through the parameters manually - but what if you entrust this work to a computer?
Selection can be made even thinner if there is time and desire - either by changing the parameters, or using a different selection strategy, for example, RandomizedSearchCV .
For some reason, the training hung when using all the cores, so I limited myself to one thread (n_jobs = 1), but in single-threaded mode, training and classification in xgboost works very quickly.
In general, it is worth noting a few points in such competitions, which seemed interesting to me:
— . ;
- . - Kaggle;
— — ;
. — -, , , ;
— , . — ? , , — - .
,
Looking through the top contestants, it is impossible not to notice the people who scored 1 (all the answers are correct) - and some got it from the very first attempt.
The next option comes to mind: someone registered an account with which he began to select (no more than 10 attempts per day are allowed) the correct answers. If I understand correctly, this is a kind of weighing task .
However, after thinking a little more, it is impossible not to smile at our guess: we are talking about a task, the answers for which have long been known! In fact, the death of Titanic was a shock to his contemporaries, and films, books and documentaries were devoted to this event. And most likely somewhere there is a complete list of names of passengers of the Titanic with a description of their fate. But this is no longer true for machine learning.
However, from this it is possible and necessary to draw a conclusion that I am going to apply in the following competitions - not necessarily (if this is not prohibited by the rules of the competition) to be limited only to the data that the organizer issued. For example, by a certain time and place, weather conditions, the state of securities markets, exchange rates, whether the day is a holiday can be identified - in other words, you can marry data from the organizers with any available public data sets that can help in describing the characteristics of the model.
Code
The full script code is here . Do not forget to choose the signs for training.