The ML Machine Campaign for Mail.Ru ML has recently ended.
Being new to machine learning, I managed to take 3rd place. And in this article I will try to share my experience of participation.
I have been participating in various sports programming contests for a long time, including other championships from Mail.Ru, from which I actually learned about it.
I was familiar with machine learning only at the level of laboratory work. I heard about such a resource as kaggle, but I did not participate in anything like this before. Therefore, it became for me a kind of challenge - remember what was taught at the university, and try to collect something from this knowledge.
I didn’t really hope for high places, but the prizes added some good motivation.
')
Shortly before the start, the task was already known. This is a binary classification task. The choice of programming language was not very great, I knew that it was customary to use Python or R. With both of them I am minimally familiar, but I chose R. In it, everything you need is out of the box, you don’t need to bother with building or installing packages . Although it is strange, it has a GC curve, it periodically falls, I did not regret my choice.
Task
There are data on the participation of players in the online game for the last 2 weeks in the form of 12 numerical signs. It is necessary to determine whether the player will leave her at least for a week, or will remain. Namely, say the probability of this event.
→
Link to the full condition of the problem.
At first glance, the task is absolutely classic. There, probably, the sea on kaggle, take the ready code and use. Even in training they offer a similar task of
credit scoring . As it turned out, not so simple.
I found it useful:
- 2 computers i5-3210M 2.5GHz × 4, 12GB RAM and i3-4170 3.7GHz × 4, 16GB RAM (there were 8, but I had to buy more)
- Installed RStudio on each
- A bit of luck
Data overview
Training and test sample without gaps. The size of 25,000 each - for so much. At this research data is almost over. I almost did not do any graphs and other visualizations.
First attempts
It was already determined in advance that I would start with logical classification algorithms - decisive trees, a decisive forest, as well as random forest, bagging, boosting over them.
Decisive trees and random forest is not difficult to program yourself. This is done according to
Vorontsov’s lectures , where the ID3 algorithm is described.
It became clear that you would not go far with samopisnyh algorithms, although this procedure helped greatly in their understanding. Need to use something ready.
Xgboost
Xgboost is a library that implements gradient boosting. Used by many winners of the kaggle competition - it means you need to try.
One of the learning parameters was the number of trees (
nrounds ), but it is not immediately clear how many should be indicated. There is an alternative - to split the sample into 2 parts - training and control. If, when adding regular trees, the control error begins to deteriorate, then stop learning. I used the Bootstrap aggregating technique.
We divide the sample 200 times randomly into training and control (according to my experiments, the optimal ratio was 0.95 / 0.05), and run xgboost. The final classifier is the voting (average) of all 200 basic classifiers.
It gave me a much better result than the Random Random Forest or AdaBoost.
Feature engineering
The next step was the generation of new signs.
The simplest thing you could think of was to generate a lot of non-linear combinations of existing features, then remove the unnecessary ones, leaving only the optimal set.
New features are just pairwise multiplication and pairwise division of each with each. Together with the baseline, 144 signs appeared.
In the same lectures, Vorontsov proposes using the greedy Add-Del algorithm, alternately removing and adding a new feature. But due to the instability of the model (at different random seed with the same data, the quality assessment varies greatly), this approach did not work.
I used a genetic algorithm. We will generate the initial population - binary vectors, meaning which signs to take and which ones to not. New individuals appear by crossing and mutation of the previous ones. Here it was necessary to work out the selection of various probabilities, fines for the number of signs, etc. For 4-6 generations and for 12 hours of work, everything usually came down to a local minimum. However, this local minimum has already given good ratings. Xgboost is not very sensitive to non-informative features (as opposed to the neural network, which will be discussed later) - one of the reasons why crossing two good sets of features also gives a good set.
As a result, 63 were selected from 144 signs.
LightGBM
Later, I switched to using the Microsoft
LightGBM library. It gave almost the same results as Xgboost, but it worked many times faster. And also had additional training options. For example, the ability to limit not only the maximum depth of a tree (
max_depth ), but also the number of leaves (
num_leaves ) was
useful . For me,
num_leaves = 9 and
max_depth = 4 were optimal.
Neural network
After unsuccessful attempts to use SVM, KNN, Random Forest, I stopped at the neural network. Or rather, on a perceptron with one hidden layer, using the
nnet package.
One of the first things I did was run something like this:
set.seed(2707) trControl = trainControl(method='cv', number=5, classProbs=T, summaryFunction=mnLogLoss) model = train(X, Y, method='nnet', metric='logLoss', maxit=1000, maximize=F, trControl=trControl)
This is practically an example from the manual. Next, I took the arithmetic mean with what I received from LightGBM, and made a parcel to the server. I was very surprised how it threw me into the first place, where I lasted about a week.
Handling of special cases
As in the training and test samples, the same vectors were present, but with different answers. For example, there were those who met 1423, 278, 357, 110 times. Probably there is nothing better than counting probabilities for them separately, which I did. Processed only those that met more than 15 times.
The question was only to exclude these repetitions from training, or not. I tried both options, and concluded that the exception makes a little bit worse.
In fact, now you can stop. This would give the final first place with a small margin. But in hindsight, everyone is smart.
Ensemble of two models
It was worthwhile to come up with a more successful aggregation function than just the arithmetic mean or the geometric mean. The idea is as follows:
- Create a new sample based on the old one. The only difference is in the answer column. Column of answers - 0 or 1, depending on which of the two models gave the best result compared to the correct answer.
- Run this sample logistic regression, SVM, boosting, or whatever. As it turned out, it was worth taking SVM.
- From these results of this aggregating model, we obtain probabilities with which we need to trust each of the two initial ones (boosting or neural network). In fact, if we use a linear model, we get the optimal coefficients, instead of the usual average.
Neural network + bootstrap
Leave what happened with the successful choice of the seed was impossible. Just lucky, and it seemed like an obvious overfit. Then I spent all the time trying to get closer to the result.
So, lucky with a good choice of seed, i.e. successfully weighed weight for neurons. Then it was necessary to run training many times, and choose the best. But how to determine whether it turned out better? And I came up with the following:
- We split the sample 200 times in a ratio of 0.9 / 0.1 (training / control).
- For each partition, we start training on a training subsample 20 times. Choose the model that gave the best result on the control.
- The final model is a vote (average) of 200 models (it is important that not all 200 × 20).
Thus, I almost came close to the desired result. But unfortunately, a little bit was not enough to win.
Unrealized ideas
- Use a graphics card to speed up learning. The official version of LightGBM does not support, but there are forks.
- Use a computing cluster of two computers. The doParallel package supports this. Previously, I just went over RDP to a second computer and started it manually.
- Theoretically, having spent several premises, it was possible to calculate more accurate probabilities for repeating vectors with different answers (in other words, to extract some more data from a hidden test subsample).
Thanks for attention.Bibliography:
→
Code on github