I want to share the experience of participating in the Kaggle competition and machine learning algorithms, with the help of which I reached
the 18th place out of 1604 in the Avazu CTR (click-through rate) mobile advertising prediction
competition . In the process, I tried to recreate the original McTricksnet algorithm, tested several logistic regression options and worked with the characteristics. All of this is below, plus I attach the full code so that you can see how everything works.
The story is divided into the following sections:
1. Terms of the competition;
2. Creating new features;
3. Logistic regression - the charm of an adaptive gradient;
4. Matrixnet - recreating the full algorithm;
5. Acceleration of machine learning in Python.
1. Competition conditions
Data provided:
- 40.4 million records for training (10 days of Avazu advertising);
- 4.5 million records for testing (1 day).
The data itself can be downloaded
here .
As an evaluation criterion, a negative error probability was stated (Negative Likelihood Loss - NLL):
')

Where
N is the number of records,
y is the value of the click variable,
p is the probability that the event was a click (“1”).
An important property of this error function is that if
p is based on a sigmoid function, then the private derivative (hereinafter, the gradient) will be
(py) .

2. Creating new features
To start, look at the source data, with which we can work:
- click - “0” - there was no click, “1” - was click, this is the goal for prediction;
- hour - time and date of the advertisement;
- banner_pos - the location of the banner (presumably, the first screen, the second, etc.);
- site *, app * characteristics - information about the place of the advertisement;
- device * characteristics - information about the device on which advertising is shown;
- C1, C14-C21 - encrypted characteristics (presumably including data on the geolocation of the display, time zone, and possibly other information).
This is not a very big field for work, as we have no historical data on users, and most importantly, we know nothing about what advertising is shown every time. And this is an important component (because you are not clicking on all the advertising in a row?).
What we create new:
- Polynomial characteristics of the 2nd level (the 3rd slows down the learning process too much);
- User_id. I tested several options, it works best - device_id + device_ip + device_model + C14 (presumably geography at the city / region level). And yes, device_id does not equal user_id;
- Frequency of contact with advertising. Usually, users who see ads for the 100th time in a day react to it differently than those who see in the 1st. Therefore, we consider the frequency of each display for each user_id.
Example of forming user id Ideas tried different, the above gave the best result. When they were formed, it was mainly based on their experience in advertising and what could be pulled from Avazu data.
We also make small transformations / transformations of data, primarily aimed at getting rid of repetitive information:
- hour - select the hour, throw away the day;
- C1 - I assume that the time zone was behind this, so after 2 numbers I merge the hour with the column;
- C15 and C16 - we unite, since the width and height of the banner are easily guessed behind them, it makes no sense to leave extra characteristics;
- Site * and app * - we transform into placement *, since it is clear that the banner is shown either on the website or in the application, and the remaining values ​​are simply encrypted NULL, which is add. does not carry information;
- We remove all values ​​that are not met in the test data. It helps to reduce retraining.
All changes were tested using logistic regression: they either gave improvements or accelerated the algorithm and did not worsen the results.
3. Logistic Regression - the charm of an adaptive gradient
Logistic regression is a popular classification algorithm. There are 2 main reasons for this popularity:
1. Easy to understand and create an algorithm;

2. The speed and adaptability of the prediction on large data due to stochastic gradient descent (stochastoc gradient descent, SGD).

Using the Avazu data as an example, let's see how, due to the stochastic gradient, we do not always “go” to exactly the minimum:

3.1. Adaptive gradient
However, if we reduce the learning rate of the algorithm (learning rate) with time, then we will come to a more accurate solution, since the gradient will not react so strongly to atypical data. Adaptive Gradient (AdaGrad) authors suggest using the sum of all previous gradients to reduce successively the learning rates:

Thus, we obtain the useful properties of the algorithm:
- Smoother descent to a minimum (learning speed decreases with time);
- The alpha for each characteristic is calculated individually (which is very important for sparse data, where most of the characteristics are very rare);
- The calculation of the alpha takes into account how strongly the parameter (w) of the characteristic has changed: the more strongly it changed before, the less it will change in the future.
The adaptive gradient begins to learn in the same way as the usual logistic regression, but then comes to a lower minimum:

In fact, if the usual stochastic gradient descent is repeated many times on the same data, then it can approach AdaGrad. However, this will take more iterations. In my model, I used a combination of these techniques: 5 iterations with the usual algorithm, and then one with AdaGrad. Here are the results:

3.2. Transformation of data for logistic regression
In order for the logistic regression algorithm to work with the data (and they are presented in the format of text values), they need to be converted to scalar values. I used one-hot encoding: conversion of textual characteristics to the NxM matrix with the values ​​“0” and “1”, where N is the number of records, and M is the number of unique values ​​of this characteristic. The main reasons are that maximum information is saved, and feature hashing allows you to quickly work with spaces with millions of characteristics as part of a logistic regression.
4. Matrixnet - build at home
After I received quite good logistic regression results, it was necessary to further improve. It was interesting for me to understand how MatrixNet of Yandex works, moreover, I expected good results from it (after all, it is one of its tasks to predict CTR inside Yandex for an advertising search result). If you collect all the publicly available information about MatrixNet (see the full list of links at the end of the article), then you can recreate its algorithm. I do not pretend that it is in this form that the algorithm works inside Yandex, I do not know, but in my code and in this article I used all the found "chips" and hints of them.
Let's go in order, what consists of Maktksnet:
- The base element is Classification and Regression Tree (CART);
- The main algorithm is Gradient Boosting Machine (GBM)
- The update of the main algorithm is the Stochastic Gradient Boosting Machine (SGBM).
4.1. Classification and Regression Tree (CART)
This is one of the
classic decision tree algorithms , which has already been written on Habré (for example,
here and
here ). Therefore, I will not go into details, just recall the principle of work and basic terms.

Thus, the decision tree has the following parameters that define the algorithm:
- Split conditions on sheets ( x_1≥0.5 )
- "Height" of a tree (number of levels with conditions, in the example above there are 2)
- The prediction rule p (in the example above, the expectation is used)
4.1.1. How to define a characteristic for a split condition
At each level, we need to define a characteristic for the split condition, which will divide the plane in such a way that we will make more accurate predictions.

Thus, we go through all the characteristics and possible splits and choose those points where we will have a lower NLL value after the split (in the example above it is, of course,
x2 ). To determine the split, other functions are usually used —
information gain and Gini impurity , but in our case we choose NLL, since this is what we were asked to minimize in the task (see the description of the task in section 1).
4.1.2. CART Regularization
In the case of CART, regularization is usually necessary in order not to make too confident predictions where we are not really sure. Yandex offers to adjust the prediction formula as follows:

Where
N is the number of values ​​in the sheet, and lambda is the regularization parameter (the McTricksnet experts recommend 100, but you need to test for each task separately). Thus, the smaller the values ​​in the sheet, the less our algorithm will be confident in the predicted value.
4.1.3. Oblivious Trees
In Matrixnet, instead of the classical approach, when, after a split in
x1, the next level of the tree cannot divide data on this characteristic, so-called are used. forgetful trees that can split values ​​within the same level on the same characteristic (as if “forgetting” that it has already been split).

The use of this type of trees, in my opinion, is justified primarily in those cases when there are 1-2 characteristics in the data, narrower splits for which give better results than those for not yet used characteristics.
4.2. Gradient Boosting Machine
Gradient Boosting Machine (GBM) is the use of a forest of short trees, where each subsequent one does not try to guess the target value, but tries to improve the forecast of the previous trees. Consider a simple example with regression and trees with a height of 1 (we can only do 1 split within each tree).

In essence, each tree helps to optimize the quadratic error function:

The main advantage of GBM in comparison with CART is a lower probability of retraining, since we give forecasts based on sheets with a larger number of values. In MatrixNet in GBM, the “height” of a tree depends on the current iteration: it starts from 1, every 10 iterations increases by 1 more, but never exceeds 6. This approach allows you to significantly overclock the algorithm, without much worsening the result at the first iterations. I tested this option, but stopped at the option when the transition to the next level is carried out after the possibilities have been exhausted at the previous one.
4.2.1. GBM for classification
When working with the classification, each subsequent tree should improve the prediction of the previous one, but it should be done in such a way that the trees work for one task - the classification using the sigmoid function. In this case, it is convenient to present the optimization problem the same as in the logistic regression, namely:

An interesting solution of Yandex is that the initial prediction of
p0 is used to predict logistic regression, and the product of weights and characteristics (wTx) as one of the characteristics.
4.3. Stochastic GBM
Stochastic GBM differs from ordinary GBM in that each tree is considered to be a sample of data (10-50%). This helps to simultaneously kill 2 birds with one stone - to speed up the work of the algorithm (otherwise we would have to calculate the result for all 40.4 million training records), and also largely get rid of the problem of overtraining.
Final result:

As you can see, all the same, the biggest optimization is the work with the data, and not the algorithms themselves. Although with the usual logistic regression, it would be difficult to rise above the 30th place, where the score goes to every ten thousandth.
5. Attempts to accelerate machine learning in Python
This was my first project to implement machine learning algorithms on my own, that is, in the code that I used to make predictions, we didn’t use ready-made third-party solutions, all algorithms and data manipulations occur in the open, which allowed me to better understand the essence of the tasks and principles of these tools. However, I used the best practices: logistic regression to a large extent - Abnishek
code on Kaggle, Matrixnet borrows a small part of CART from Stephen Marshall's
code to the book Learning: Algorithmic Perspective.
From the point of view of implementation, I started experimenting with the task in R, but then quickly abandoned it, since it is almost impossible to work with big data. Python was chosen because of the simplicity of the syntax and the presence of a large number of libraries for machine learning.
The main problem with CPython is that it is VERY slow (albeit much faster than R). However, there are many options to speed it up, and as a result I used the following:
- PyPy is a JIT compiler that allows you to speed up standard CPython x20 times, the main problem is that there are practically no computation libraries (NumPyPy is still in development), everything needs to be written without them. Perfectly suited for the implementation of logistic regression with stochastic gradient descent, as in our case;
- Numba - jit decorators allow you to pre-compile some functions (as in PyPy), but the rest of the code can take full advantage of the CPython libraries. A big plus - for precompiled functions, you can remove GIL (Global Interpreter Lock) and use several cpu. Which is what I used to speed up MatrixNet. The problem with Numba is the same as with PyPy - there is no support for any libraries, the main difference is that Numba has some NumPy functions.
I did not reach Cython, as I tried to accelerate with minimal blood, but I think it’s easier next time to switch to GPGPU using Theano / Numba.
The complete code of all transformations with data and the learning algorithms themselves is here. Disclaimer: the code is not optimized, it is intended only for studying the algorithms themselves.
If you find any inaccuracies or errors in the article or code, write in a personal.
Links to sources used for the article and when working on algorithms: