📜 ⬆️ ⬇️

How to get to the top on Kaggle, or Matriksnet at home

I want to share the experience of participating in the Kaggle competition and machine learning algorithms, with the help of which I reached the 18th place out of 1604 in the Avazu CTR (click-through rate) mobile advertising prediction competition . In the process, I tried to recreate the original McTricksnet algorithm, tested several logistic regression options and worked with the characteristics. All of this is below, plus I attach the full code so that you can see how everything works.

The story is divided into the following sections:
1. Terms of the competition;
2. Creating new features;
3. Logistic regression - the charm of an adaptive gradient;
4. Matrixnet - recreating the full algorithm;
5. Acceleration of machine learning in Python.

1. Competition conditions


Data provided:

The data itself can be downloaded here .

As an evaluation criterion, a negative error probability was stated (Negative Likelihood Loss - NLL):
')


Where N is the number of records, y is the value of the click variable, p is the probability that the event was a click (“1”).

An important property of this error function is that if p is based on a sigmoid function, then the private derivative (hereinafter, the gradient) will be (py) .



2. Creating new features


To start, look at the source data, with which we can work:

Sample data


This is not a very big field for work, as we have no historical data on users, and most importantly, we know nothing about what advertising is shown every time. And this is an important component (because you are not clicking on all the advertising in a row?).

What we create new:
  1. Polynomial characteristics of the 2nd level (the 3rd slows down the learning process too much);
  2. User_id. I tested several options, it works best - device_id + device_ip + device_model + C14 (presumably geography at the city / region level). And yes, device_id does not equal user_id;
  3. Frequency of contact with advertising. Usually, users who see ads for the 100th time in a day react to it differently than those who see in the 1st. Therefore, we consider the frequency of each display for each user_id.

Example of forming user id


Ideas tried different, the above gave the best result. When they were formed, it was mainly based on their experience in advertising and what could be pulled from Avazu data.

We also make small transformations / transformations of data, primarily aimed at getting rid of repetitive information:

All changes were tested using logistic regression: they either gave improvements or accelerated the algorithm and did not worsen the results.

3. Logistic Regression - the charm of an adaptive gradient


Logistic regression is a popular classification algorithm. There are 2 main reasons for this popularity:
1. Easy to understand and create an algorithm;



2. The speed and adaptability of the prediction on large data due to stochastic gradient descent (stochastoc gradient descent, SGD).



Using the Avazu data as an example, let's see how, due to the stochastic gradient, we do not always “go” to exactly the minimum:



3.1. Adaptive gradient

However, if we reduce the learning rate of the algorithm (learning rate) with time, then we will come to a more accurate solution, since the gradient will not react so strongly to atypical data. Adaptive Gradient (AdaGrad) authors suggest using the sum of all previous gradients to reduce successively the learning rates:



Thus, we obtain the useful properties of the algorithm:

The adaptive gradient begins to learn in the same way as the usual logistic regression, but then comes to a lower minimum:



In fact, if the usual stochastic gradient descent is repeated many times on the same data, then it can approach AdaGrad. However, this will take more iterations. In my model, I used a combination of these techniques: 5 iterations with the usual algorithm, and then one with AdaGrad. Here are the results:



3.2. Transformation of data for logistic regression

In order for the logistic regression algorithm to work with the data (and they are presented in the format of text values), they need to be converted to scalar values. I used one-hot encoding: conversion of textual characteristics to the NxM matrix with the values ​​“0” and “1”, where N is the number of records, and M is the number of unique values ​​of this characteristic. The main reasons are that maximum information is saved, and feature hashing allows you to quickly work with spaces with millions of characteristics as part of a logistic regression.

Example one-hot encoding



4. Matrixnet - build at home


After I received quite good logistic regression results, it was necessary to further improve. It was interesting for me to understand how MatrixNet of Yandex works, moreover, I expected good results from it (after all, it is one of its tasks to predict CTR inside Yandex for an advertising search result). If you collect all the publicly available information about MatrixNet (see the full list of links at the end of the article), then you can recreate its algorithm. I do not pretend that it is in this form that the algorithm works inside Yandex, I do not know, but in my code and in this article I used all the found "chips" and hints of them.

Let's go in order, what consists of Maktksnet:
  1. The base element is Classification and Regression Tree (CART);
  2. The main algorithm is Gradient Boosting Machine (GBM)
  3. The update of the main algorithm is the Stochastic Gradient Boosting Machine (SGBM).

4.1. Classification and Regression Tree (CART)

This is one of the classic decision tree algorithms , which has already been written on Habré (for example, here and here ). Therefore, I will not go into details, just recall the principle of work and basic terms.



Thus, the decision tree has the following parameters that define the algorithm:

4.1.1. How to define a characteristic for a split condition

At each level, we need to define a characteristic for the split condition, which will divide the plane in such a way that we will make more accurate predictions.



Thus, we go through all the characteristics and possible splits and choose those points where we will have a lower NLL value after the split (in the example above it is, of course, x2 ). To determine the split, other functions are usually used — information gain and Gini impurity , but in our case we choose NLL, since this is what we were asked to minimize in the task (see the description of the task in section 1).

4.1.2. CART Regularization

In the case of CART, regularization is usually necessary in order not to make too confident predictions where we are not really sure. Yandex offers to adjust the prediction formula as follows:



Where N is the number of values ​​in the sheet, and lambda is the regularization parameter (the McTricksnet experts recommend 100, but you need to test for each task separately). Thus, the smaller the values ​​in the sheet, the less our algorithm will be confident in the predicted value.

4.1.3. Oblivious Trees

In Matrixnet, instead of the classical approach, when, after a split in x1, the next level of the tree cannot divide data on this characteristic, so-called are used. forgetful trees that can split values ​​within the same level on the same characteristic (as if “forgetting” that it has already been split).



The use of this type of trees, in my opinion, is justified primarily in those cases when there are 1-2 characteristics in the data, narrower splits for which give better results than those for not yet used characteristics.

4.2. Gradient Boosting Machine

Gradient Boosting Machine (GBM) is the use of a forest of short trees, where each subsequent one does not try to guess the target value, but tries to improve the forecast of the previous trees. Consider a simple example with regression and trees with a height of 1 (we can only do 1 split within each tree).



In essence, each tree helps to optimize the quadratic error function:



The main advantage of GBM in comparison with CART is a lower probability of retraining, since we give forecasts based on sheets with a larger number of values. In MatrixNet in GBM, the “height” of a tree depends on the current iteration: it starts from 1, every 10 iterations increases by 1 more, but never exceeds 6. This approach allows you to significantly overclock the algorithm, without much worsening the result at the first iterations. I tested this option, but stopped at the option when the transition to the next level is carried out after the possibilities have been exhausted at the previous one.

4.2.1. GBM for classification

When working with the classification, each subsequent tree should improve the prediction of the previous one, but it should be done in such a way that the trees work for one task - the classification using the sigmoid function. In this case, it is convenient to present the optimization problem the same as in the logistic regression, namely:



An interesting solution of Yandex is that the initial prediction of p0 is used to predict logistic regression, and the product of weights and characteristics (wTx) as one of the characteristics.

4.3. Stochastic GBM

Stochastic GBM differs from ordinary GBM in that each tree is considered to be a sample of data (10-50%). This helps to simultaneously kill 2 birds with one stone - to speed up the work of the algorithm (otherwise we would have to calculate the result for all 40.4 million training records), and also largely get rid of the problem of overtraining.
Final result:



As you can see, all the same, the biggest optimization is the work with the data, and not the algorithms themselves. Although with the usual logistic regression, it would be difficult to rise above the 30th place, where the score goes to every ten thousandth.

5. Attempts to accelerate machine learning in Python


This was my first project to implement machine learning algorithms on my own, that is, in the code that I used to make predictions, we didn’t use ready-made third-party solutions, all algorithms and data manipulations occur in the open, which allowed me to better understand the essence of the tasks and principles of these tools. However, I used the best practices: logistic regression to a large extent - Abnishek code on Kaggle, Matrixnet borrows a small part of CART from Stephen Marshall's code to the book Learning: Algorithmic Perspective.

From the point of view of implementation, I started experimenting with the task in R, but then quickly abandoned it, since it is almost impossible to work with big data. Python was chosen because of the simplicity of the syntax and the presence of a large number of libraries for machine learning.

The main problem with CPython is that it is VERY slow (albeit much faster than R). However, there are many options to speed it up, and as a result I used the following:

I did not reach Cython, as I tried to accelerate with minimal blood, but I think it’s easier next time to switch to GPGPU using Theano / Numba.

The complete code of all transformations with data and the learning algorithms themselves is here. Disclaimer: the code is not optimized, it is intended only for studying the algorithms themselves.

If you find any inaccuracies or errors in the article or code, write in a personal.

Links to sources used for the article and when working on algorithms:

Source: https://habr.com/ru/post/254151/


All Articles