Good afternoon, reader! This article will talk about how to get second place in the MLBootCamp III competition. For those who are not up to date - this is the competition for machine learning and data analysis from Mail.Ru Group, held from February 15 to March 15.
The article will briefly describe the history of the construction of the solution, some tips about what made cones and thanks.
So let's go.
SpoilerThis narration is written in vivid Russian, is rather poorly scientific and prosaic, although it contains some quite good, from the point of view of the author, advice.
The challenge "Exit online games"It was necessary to learn to predict whether the participant remains in the online game or leaves it. Data for training / test is calculated in the last two weeks, and leaving is considered to be its absence in the game during the week.
')
A total of 12 features, calculated for the previous 2 weeks, are used:
- maxPlayerLevel - the maximum level of the game that the player has passed
- numberOfAttemptedLevels - the number of levels that the player tried to pass
- attemptsOnTheHighestLevel - the number of attempts made at the highest level
- totalNumOfAttempts - the total number of attempts
- averageNumOfTurnsPerCompletedLevel - the average number of moves performed on successfully completed levels
- doReturnOnLowerLevels - whether the player made returns to the game at levels already completed
- numberOfBoostersUsed - the number of boosters used
- fractionOfUsefullBoosters - the number of boosters used during successful attempts (the player has passed the level)
- totalScore - total points scored
- totalBonusScore - total bonus points earned.
- totalStarsCount - the total number of stars scored
- numberOfDaysActuallyPlayed - the number of days when the user played the game
All the data provided for the task are divided into two parts: training (x_train.csv and y_train.csv) and test (x_test.csv). Each line of the x_train.csv and x_test.csv files corresponds to one user. The data in the row is separated by a semicolon. The first line contains the names of the signs. The y_train.csv file contains the values 1 or 0, depending on whether the user remains in the game or exits from it, respectively.
Both training (x_train.csv and y_train.csv) and test (x_test.csv) samples contain information about 25289 users.
The answer for this problem is a text file, each line of which corresponds to a line in the x_test.csv file and contains a value from 0 to 1 (the probability of the user remaining in the game). The logarithmic loss function is used as the quality criterion for solving the problem.
The number of parcels is limited to five per day.
The test sample is randomly divided into two parts in a 40/60 ratio. The result of the first 40% will determine the position of participants in the rating table throughout the competition. The result for the remaining 60% will be known after the end of the competition, and it is he who will determine the final placement of participants.
Logloss was used as a metric, so it was worth remembering that:
- Many classifiers already know how to minimize logloss "out of the box";
- Logloss "does not forgive" when rounding up answers to 1.0 or 0.0, instead of, for example, 0.97 and 0.03 and you are mistaken (see the chart below).
Logloss graph, the first image from google Therefore, you should not touch the results of the prediction with your hands, and moreover round off.
About solving a problem
The first submission was made almost immediately after the opening of the competition - this is a random forest on the top 10 best, in his opinion, features. After that, a step was taken, which was spent about a week, but it was not effective - quadratic features were generated and the simplest greedy feature selection algorithm was launched based on the result of cross-validation.
The algorithm is as follows: we teach the forest, we take the importance of features from it, and, starting with the first two, we add one feature to our current set. If adding a feature leads to an improvement in the result - leave the feature in the set. If not, we throw it back and never return to it.
Upon completion of this simple algorithm, we use another heuristic - throw out one feature at a time and look at the result when working on all the others. If the result improves - then the feature in the furnace.
While the search for features generated forests and warmed up the apartment, I wanted to train the logistic regression and play around with xgboost, since I have not used it before (yes-yes).
At first I tried the logistic regression on the available features (not forgetting to normalize them, of course), then decided to convert all the features as follows:
- beat all feature values on N intervals of equal size 1 / N on percentiles
- we encode the value of the feature as ohe, i.e. if the value of the feature in the example falls into the interval - we set 1, otherwise 0, so the vector [0 0 ... 0 1 0 ... 0] of length N is obtained from one value
The idea was as follows - a different player level (the number of days he played, the number of stars, etc.) nonlinearly affect the target variable, then you can try to bring this nonlinearity into a linear model in this transformed form (although in such , trimmed, "group" version).
Such a transformation yielded about 0.3836 for cross-validation - not bad for a logistic regression (as it turned out later, it can be better).
Then I started xgboost and quickly got a good solution for cross-validation, about the same as in a random forest at that time (something around 0.3812).
After that, I added the results of three quite good cross-validation classifiers into one, divided it into three, sent it, and ... I was disappointed, because the public leaderboard turned out to be a set of examples in which my ensemble worked very badly. By the way, this situation was observed not only for me, but for all my friends and acquaintances who solved the problem. It was ridiculous - the better the classifier behaved in cross-validation, the lower he dragged me into a public leaderboard. This situation is not weakly motivated, so interest in the task was lost for about a week.
<Week has passed>
One day I opened the leaderboard, and saw how my friend went up the table, and in one (or two) days I was already in the top twenty. “Wow,” I thought, rolled up my sleeves, and again took up the task.
The method of close scrutiny of graphs revealed that there were people with one day of participation in an online game, having a 100+ level, a lot of bonuses and points. This suggested that people actually played for a long time, just a two-week window, in which we consider features, hooked them in just one day. Having broken my head over a couple of evenings over how it would be used, on the basis of each of the maxPlayerLevel, totalStarsCount and totalNumOfAttempts features, I decided to make one new “predicted number of days” that the player plays. The idea is the following: we take the lower values (took everything below the 4th percentile) in each of the days for the selected feature (for example, maxPlayerLevel), we get a rather nice graph that approaches the polynomial quite well (took the fourth degree, although the third would be enough). Having a built polynom, we can now find its roots for the player’s feature value and cost predictions of the number of days the player has played. It looks like this:

On the x axis, the number of days is plotted; on the y axis, the value of the selected features. A broken curve is the same feature values that lie below the 4th percentile, a beautiful smooth curve is a graph of the approximating polynomial.
Approximate polynomial codefrom numpy.polynomial import Polynomial as P class nlSolver(): def __init__(self): self.p=None def fit(self,f_x,f_y): self.p = P.fit(f_x, f_y, 4) return self def get_y_by_x(self,f_x): return self.p(f_x) def get_x_by_y(self,f_y,initial_x): res = max([r.real for r in (self.p - f_y).roots()]) return res
Having thus obtained three more signs, he normalized the original features (except for the number of days the player spent in the game) on each of the new ones, thus the total number of features increased fourfold. He invented and added two dozen more features, among which, for example, such:
... X['positiviness'] = X.totalStarsCount*X.totalBonusScore*X.usefulBoosters X['winDesire'] = X.attemptsPerDay*X.numberOfBoostersUsed*X.attemptsOnTheHighestLevel X['positivinessPerWinDesire'] = X.positiviness/(X.winDesire+1) ...
Further, due to the proliferation of signs, I did not begin to generate quadratic ones and selected the best features from the available ones, trained the ensemble and turned out to be in 60th place, rising from ~ 120th.
Sklearn's logistic regression was replaced with the stochastic implementation of that in vowpal wabbit (used the Python wrapper of its own writing, about it another time), simultaneously abandoned its idea of using feature coding at intervals and slightly improved the result of the logreg on cross-validation (here there were also a few self-made crutches to get the result of the cross prediction). Since vw has already been involved, why not use the perceptron from it (the --nn flag) - so another classifier was added to the ensemble.
Well, since the vw perceptron was involved, why not use the MLPClassifier.
After that, two more classifiers based on a random forest were added to the ensemble. One of them was trained on a sample with the deviations removed on each of the days according to the feature values (all the examples, the feature values in which are below the 1st percentile and above the 99th did not participate in the classifier training), the second - just on the newly selected features.
In total, by the end of the competition the ensemble consisted of 7 classifiers. How for all this to make a normal scheme of cross-validation and not much to suffer? The construction scheme is as follows:
- Each of the classifiers put in a separate folder in the project.
- Each of the classifiers, instead of cross-validation, makes a cross prediction on the train, the result is put into a file in its own folder, with a definite ending (to make it easier to search for these files).
- In each folder we train one classifier on the entire train, we make a prediction on a test sample, which we also put in a file in our folder with a definite ending.
- In the directory above the classifier folders, we make a common ensemble script that, for cross-validation, loads all cross-predicates from all the necessary classifier folders, converts them according to the rules we need into the result of the ensemble and calculates the metric we need (logloss).
- Having received the results of the ensemble validation, it remains to load the predictions from each folder, and convert them according to the same rules to obtain the ensemble forecast on the test sample.
- Add a few finishing touches.
Finishing touches- In the sample there were samples with an identical set of features, but with different values of the target variable. For such samples, the probability prediction was calculated as the ratio of the number of positive examples to the number of negative ones (this minimizes the logloss, you can write it out on a piece of paper and check it).
- Instead of “add everything up and divide by the number of classifiers,” we add the result of each classifier with its own, selected, weight and divide by the sum of weights. The coefficient of one of the classifiers does not need to be modified - thus fixing the scale, with the other coefficients we “play” until the result of cross-validation decreases. The improvement on 0.0005-0.0007 is quickly found with the hands (you can not do it with your hands, but pick up the weights with regression).
The last day
On the last day, when the decision was already collected and sent, my best result was in public at 37th place, and for cross-validation I issued logloss 0.37906093. About what took second place, I learned from the message of a friend in Skype. I thought it was a joke, but when I went to the site, I was convinced that it really was true.

Useful Tips
The word "useful" deliberately placed in quotes, because Most of them meet more than once in every story about winning a machine learning competition:
- At the start, you should not draw beautiful graphics all day long, count statistics and do other tricks, but rather make the first submit. Spend an hour or two and do not worry about the place in the table yet. This will allow laziness to win from the very start, check whether you understood the task correctly, whether the submission system works on the side of the organizers of the competition, whether the response file is formed correctly, etc.
- After the first submission, you should look at the data, statistics, graphs and try to understand what the task is about. For example, in this task, for some reason, I first thought that all the people on the basis of which examples were formed in the samples began to play from the very first day, and then someone stopped the road, and only after applying the attentive scrutiny method I realized that it was not this way.
- Every change, every experience and every idea should be checked separately, while making copies of all that is possible. If possible, you should, of course, use a version control system, but sometimes you can do without it. For example, if you are working in ipython notebook, make a copy of your laptop before conducting the next experiment. Hard disk space is much cheaper than nerves spent on finding "exactly that" right option, which gave the desired solution.
- Get yourself a board, notebook, or a program in which you write down the ideas that come to mind. I, for example, use trello, there I have three groups of cards: "New ideas", "Working ideas" and "Non-working ideas". I write down new ideas in the first group, and then after conducting the experiment, it moves either to the second or to the third.
- Cross-validate, and do not forget to make a prediction file for a test of current experience. Also a good approach (if not even better) is to record files with a cross-predicate on the training set. Pluses become apparent when combining models into an ensemble.
Believe in the result of your cross-validation.
Thanks
Many thanks to Mail.ru, as well as Ilya Stytsenko and his team for a wonderful contest and presents. Many thanks to my friends and rivals in the competition (and not only) for the support / opportunity to talk / consult and share opinions, I think it was useful for all of us. Thank you my girlfriend for her patience and patience again.