ML Boot Camp V, solution history for 3rd place

In mid-July, the ML.Bot Camp V machine-test contest from Mail.Ru ended. It was necessary to predict the presence of cardiovascular diseases according to the results of a classical medical examination. The metric was the logarithmic loss function. A complete description of the task is available here .

Familiarity with machine learning for me began with ML Boot Camp III sometime in February 2017 and some sort of idea of what to do with such tasks begins to take shape for me just now. Much of what was done in contest 5 is primarily the result of studying the collection of articles on kaggle and the discussions and code examples from there. Below is a slightly revised report on what had to be done to take 3rd place.

Task data

Dataset is formed from 100,000 real clinical tests. Given the age, height, weight, sex, upper and lower blood pressure, cholesterol and glucose in the blood.

In addition, there are “subjective” data - that patients reported about themselves when answering questions about smoking, alcohol consumption and physical activity. This part of the data was also spoiled by the organizers, so I had no special hopes with them.
')
The initial data contained obviously unrealistic values - there were people with a height of 50 cm in 30+ years, with a pressure like 16020, with negative pressures. This was due to errors when manually entering the data analysis.

Instruments

The problem was solved in python using standard for this case libraries:

pandas - read-write and processing of tabular data (in fact, a lot of things more, but in this case, the rest is not needed);
NumPy - operations on arrays of numbers;
scikit-learn - a set of tools for machine learning, including basic MO algorithms, data partitioning, validation;
XGBoost is one of the most popular implementations of gradient boosting;
LightGBM is an alternative to XGBoost;
TensorFlow + Keras - a library for learning and using neural networks and a wrapper for it;
Hyperopt - a library for optimizing functions in a given argument space;

Csv vs pickle

For data storage with long calculations, I first used csv until I needed to save together more complex structures than separate tables. The pickle module proved to be very good - all the necessary data is saved or read in 2 lines of code. Later it became save in compressed files:

with gzip.open('../run/local/pred_1.pickle.gz', 'wb') as f: pickle.dump((x, y), f)

Repository

All the code related to the competition lies on github . In the repository in the old / hidden old scripts, the real benefits of which was not, and left only because the results of their work were also sent to the test. Due to errors in the code, the intermediate results of their execution were later unsuitable for use, so this part of the code had no effect on the final decisions.

First 2 weeks

For the first 2 weeks I cleaned the data, substituted it into the models left over from past competitions, and all this without much success. For each new submission, all the code from one of the existing scripts was copied entirely into a new one and was already edited there. The result - at the end of the second week, he couldn’t say right away what any of the latest scripts does and what functions it actually uses and which simply occupy space. The code was cumbersome, hard to read, and could work for several hours and fall without saving anything useful at all.

Second 2 weeks

When 2 weeks after starting to copy the old scripts and it became too difficult to change them a bit, I had to start a complete rework of the entire code. It was divided into common parts - base classes and their specific implementations.

The general idea of the new code organization is the pipeline: data → signs → level 1 models → level 2 models. Each stage implements a separate script file, which, when launched, performs all the required calculations, saves them, intermediate results and its own data. The script for each next stage imports the code of the previous stages and receives data for processing from their methods. The idea behind all this is that in order to be able to run a script for one of the final models, he launched the scripts for the models at a lower level, they called the sign generators they needed, which in turn launched the desired data cleansing option. The task of each script is to check if the file exists, where it should save its results, and if not, perform the necessary calculations and save the data.

Behind all this was a plan to make a decision for organizing models and data for future use, simultaneously debugging it on a suitable task. Actually this decision was the most important result of the contest, which is now gradually developing into a small library to make life easier when participating in subsequent competitions of this kind.

Overall plan

Initially, 2 levels of models were planned and it was necessary to prepare as many different models as possible to the 1st level. The way to achieve this is to prepare as much as possible differently processed data, on which similar models will be trained. But to prepare the data is a long matter. Although working with data is the key to success (a sufficient number of added meaningful features makes it possible to get by with the simplest models), but it takes more time than we would like. The alternative is a brute force solution, that is, relatively moderate data processing and maximum computing time.

The simplest with this approach is to process the data in several ways, come up with several sets of additional features and use their combinations. The result is a random version of the Random subspace method , which differs from the full version in that it is not at all random and the signs are selected immediately in groups. So with a small number of additional features, you can get hundreds of variants of the processed data (the actual number of cleaning methods * (2 ^ number of groups of features)). It was assumed that such an approach would give rather different solutions to simple models using different subsets of features, so that each of them would improve the quality of level 2 models.

Data preparation

The fact that the original data was dirty had to somehow be taken into account. The main approaches are either to throw out all obviously impossible values or try to somehow restore the original data. Since the source of such distortions remained unknown almost to the end - we had to prepare the data in several ways, and then train different models on them.

Each of the data processing options is implemented by a class that, when returned, returns a full dataset with corresponding changes. Since data processing at this stage is rather fast - intermediate results were saved only for a relatively long option (2) - recovery of subjective signs using xgboost. The rest of the data was generated on request.

Processing options:

The initial data in which the spoiled values of the subjective part of the test are replaced by 0.0001, to bring them to a numerical form, but to distinguish them from the intact ones.
The spoiled subjective signs were replaced: alcohol consumption — 0, activity — 1. Then, for the remaining data columns, smoking was “restored”.
In data with reconstructed subjective signs, extreme pressure values have been cleaned.
In the data with the restored subjective signs (from item 2), the extreme values of pressure, weight and height were cleaned.
In the data with only purified pressures (from item 3), the weight, height, and pressure are additionally cleaned.
Data with refined pressures is additionally converted — any individual implausible height, weight, or pressure value is replaced with NaN.

Signs of

Additional features were generated from the processed data. There were very few meaningful ones - body mass index, expected pressure values depending on gender, weight and age using some old formula, etc. Much more additional data columns were automatically obtained using rather simple methods.

Additional attributes were generated from different variants of the processed data, but often in the same way. Since some of the signs could require too much time to recalculate their values - the columns of signs were kept separately. The calculation of attributes in scripts was implemented in the same way as data cleansing - in each script, a method was defined that returns additional columns of attributes.

Groups of additional features:

The simplest meaningful signs - BMI , pulse pressure, averaged pressure values $ inline $ \ frac {ap \ _hi + x * ap \ _lo} {x + 1} $ inline $ for different x values. Approximate formulas for calculating pressures by age / weight were also taken and the expected pressures were calculated for each patient (formulas $ inline $ ap \ _X = a + b * age + c * weight $ inline $ ). Calculated from raw values.
Same as in claim 1, but additionally, based on the available pressures, an attempt was made to restore the patient's weight. For each of the features predicted in this way, a difference with a “real” value is added. Calculated from raw value.
The textual representation of the raw data columns, broken character by character - first with alignment to the left, then to the right. Characters are replaced by their numeric values ( ord () ). Where the line was too short and not enough for all columns - -1 was set.
Same as in p.3, but the received columns are binary coded (one-hot encoding).
The data from clause 4, but missed through the PCA is a heavy legacy of the recent mercedes competition on kaggle, where models with this shamanism looked pretty good in public and sadly in private.
For all columns of raw raw data except age, the average values of the target column are calculated. To do this, first the values of pressure, height and weight are divided by 10 and rounded, having obtained from them in this way categorical signs. Then I divided the data into 10 folds, and for each of them, 9 other folds calculated for each category a weighted average value of the target column (sick-not ill). Where there was nothing to calculate the average, I simply put down the global average.
Same as in p.6, but the averages were calculated also for the signs from p.2.
Same as in clause 7, but cleared for option # 5 are taken as the source data.
Same as in clause 7, but cleared for option # 3 are taken as the source data.
The raw data is clustered by the k-means method, the number of clusters is chosen arbitrarily - 2, 5, 10, 15, 25. The cluster number for each of these cases is binary coded.
The same as in paragraph 10, but used data cleared for option # 3.

Models

Since models can work for a very long time (tens of hours), fall with errors, or be interrupted specially - it is necessary to save not only the final results, but also intermediate data. For this, each model is given a base name. Further, from the model name and the name assigned to the data, the name of the file is obtained, where this data will lie. All saving-loading go through the basic methods of the model, which ensures uniform storage of intermediate data. Future plans are to keep data not in files, but in some database. Lack of implementation - you can forget to update the name when copying the model and get an undefined state for the data of the original model and its copy.

If the model has saved the results of its calculations, it only remains to access them and return them to the caller. If there are only intermediate results - they also do not have to be re-counted. This saves a lot of time, especially when it comes to computing for a few hours.

The main separation of the data stored by the models is by the lifetime of the data. Each such data group has its own base path for storage. There are 3 such groups:

Temporary ones that will not be used during the following launches, for example, the best weights of neural networks for individual folds;
These models, which will be needed in the next launches - almost everything else;
Globally useful data that may be useful to several models, for example, additional features;

The interface for all models is common and allows not only to run them separately as regular scripts, but also to load as python modules. If some model needs the results of the work of other models, it loads and implements them. As a result, the description of each of the Level 2 models was reduced to the list of model names, whose results should be combined, and the indication of the need to select features by a greedy algorithm.

Among the models were based on neural networks, which at the output could give very confident 0 or 1 or very close to extreme values. Since, in the event of an error, such overconfidence is penalized by logloss very much - the values of all models, while being preserved, were cut off so that at least 1e-5 remained to 0 or 1. The easiest way was to add np.clip (z, 1e-5, 1-1e-5) and forget about it. As a result, the data of all models were cut off, but most of them gave results in the range of about 0.1-0.93.

hyperopt

To fit the parameters of the models, hyperopt had to be used ( details ). The results improved, but at the same time I set the number of attempts on the order of 20 for particularly slow models. And 2 days before the end of the article I found mentioning hyperopt bootstrapping - by default, the first 20 model launches are made with random parameters, which can be seen in the source . I had to urgently recount some of the models.

Level 1 models

A selection of input data for each model was included in the general code of level 1 models - one option is necessarily to clear the source data and 0 or more groups of features. The collection of data and attributes into a common dataset is implemented in general for model code. This reduced the code of individual models to specify specific source data and additional features.

There was not enough time to make the general code for optimization, so that the individual basic models of level 1 still copy each other greatly. In total, they turned out 2 varieties:

neural networks (keras)
trees (XGBoost, LightGBM, rf, et)

The main difference between the neural network models used is the lack of hyperparameter fitting. For the rest of the models, hyperopt was used.

Neural networks

I didn’t do any serious selection of parameters for neural networks, so their results were worse than boosting. Later in the chat, I saw a mention of a network device like 64-64 with the activation of leaky relu and a dropout of 1-5 neurons in each layer, which gave a relatively decent result.

I used neural networks like the following:

entrance;
several hundred neurons (usually 256);
some non-linearity, drop-out (where it was, it took values of the order of 0.7, because it believed that there were too many parameters and the network was being retrained); if, when training, the model diverged into nan-s - added batch normalization - details here or here ;
a hundred or two neurons (64-128);
nonlinearity;
a dozen or two neurons (16);
nonlinearity;
1 output neuron with classic sigmoid output.

Such a device migrated from previous competitions almost unchanged. Separately, the neural networks did not show themselves, but were left to use their results in the calculation of level 2 models.

The choice of activation functions for the inner layers is very simple - excluded from the available set all sigmoid variants (due to close to 0 gradients at the boundaries of values), “pure” ReLU (due to the fact that the neuron that started issuing 0 at the output, from training falls out) and took something from the rest. Initially it was Parametric Relu , in the latest models began to take Scaled Exponential Linear Units . No significant differences from such a replacement could not be noticed.

As for other models, the data for neural networks beat on folds using KFold from sklearn. For learning, at each partition it was necessary to build a model again, since I learned too late how to re-initialize the weights in the layers of the network without re-creating it.

The networks were trained as long as the quality of the predictions on the data allocated for validation improved. At the same time, weights for the network were maintained each time the result on validation improved. For this, the standard callbacks and keras are for maintaining the state of the network with the best results for validation, for early termination of training if the results did not improve for a given number of passes in the training set and for reducing the learning rate if the results did not improve for several passes.

It turned out that if the network training came to a standstill (local minimum) and the results did not improve in several passes according to the data, the learning rate decreased and if it did not help, after a few more moves the training stopped. After training, the best state of the weights of the network was loaded for the entire duration of the training.

At the same time, I quite late noticed the problem that arises when I try to use the same set of callback instances several times when teaching several networks. In this case, the state of callbacks when starting learning a new network is not automatically reset to the initial one. As a result, the learning rate for each new network declined more and more, and the best results were not maintained if they were not better than all previously obtained on all networks where the same callbacks were used.

Tree based models

Two variants of “tree” models were used - based on bagging random forest and extra trees and 2 implementations of gradient boosting - XGBoost and LightGBM. Both variants of the random forest showed themselves to be both on cross-validation and on public, so they remained simply because they had to spend a lot of machine time and there was hope that they would be useful when combining the results of the models. LightGBM and XGBoost performed much better and most of the predictions on the first level were received from them.

Each of the "tree" models after fitting the parameters was calculated for several (usually - 3) initial states of the random number generator. All such results were stored separately for use by Level 2 models. The predictions of the level 1 models were obtained from the result for the last used state of the RNG.

LightGBM and XGBoost provide the ability to stop learning if the learning quality does not improve on validation for a given number of iterations. Due to this, it was possible to simply allow them to learn 10,000 steps and stop when the results on validation stop improving. As a result, when selecting the parameters of such models, the number of trees was not needed. For random forest and extra trees from sklearn there is no such possibility, so I had to postpone the selection of the number of trees on the hyperopt, which, it seems, for an insufficient number of attempts, did not fully cope with the task. It was possible to train them one at a time, each time checking the quality for validation itself, but laziness prevented it.

Few seeders

The result of the work of individual models strongly depends on the state of the random number generator. To get rid of this dependence for the models of level 1, the results of training with several sids were calculated. At the same time for each Sid, the results were stored separately. In this case, after the end of the competition, it turned out that as a result, saved by a separate model of level 1, the result was used for the last of the seed. The results for the rest were still saved and used by Level 2 models.

Level 2 Models

Due to the fact that each of the models of level 1 gave from 1 to 4 predictions, at level 2 the data contained up to 190 columns. Baseline data and signs did not get there - only the predicted probabilities. 2 1 ( 2 ).

2 — , , .

— 2 . , .

, , «» — , . 2 . BayesianRidge , Ridge . 20 .

hyperopt , - BayesianRidge Ridge sklearn, BayesianRidge Ridge.

Validation

10 . cv 0.534-0.535 0.543-0.544, . , 30 . 30 — 10 .

0.535-0.536, 0.543 . 3 30 30 0.7 0.3 . 30 — cv — . random_state. 0.537.

, , , . 2 0.543 0.538 . , 12 7 3 , .

Source: https://habr.com/ru/post/335188/

All Articles