ML Boot Camp III opening soon

February 15 Machine Learning Boot Camp III starts - the third machine learning and data analysis contest from the Mail.Ru Group. Today we talk about the last contest and discover the secrets of the new! So, in the course of the upcoming contest, you will need to guess whether the participant will remain in the online game or leave it. Task samples are built on twelve game attributes for 25,000 users. Naturally, all data is anonymized.

Signs themselves:

maxPlayerLevel - the maximum level of the game that the player has passed;
numberOfAttemptedLevels - the number of levels that the player tried to pass;
attemptsOnTheHighestLevel - the number of attempts made at the highest level;
totalNumOfAttempts - total number of attempts;
averageNumOfTurnsPerCompletedLevel - the average number of moves performed on successfully completed levels;
doReturnOnLowerLevels - whether the player made returns to the game at levels already completed;
numberOfBoostersUsed - the number of boosters used;
fractionOfUsefullBoosters - the number of boosters used during successful attempts (the player has passed the level);
totalScore - total points scored;
totalBonusScore - total bonus points earned;
totalStarsCount - the total number of stars scored;
numberOfDaysActuallyPlayed - the number of days the user played the game.

We split the test sample randomly in a 40/60 ratio. The result of the first 40% will determine the position of participants in the rating table throughout the competition. The result for the remaining 60% will be known after the end of the competition, and it is he who will determine the final placement of participants.
')
The winner of the best solution, we will give MacBook Air. The second and third place will get the Apple iPad. Fourth, fifth and sixth - Apple iPod Nano. By tradition, the top 50 participants will receive memorable T-shirts with the symbols of the championship. In addition, we will invite the best of the best to the Mail.Ru Group for interviews on positions related to data analysis. Register for the championship here .

Machine learning boot camp ii

To make the participants better understand what they have to do, we present the task of the last championship and the best solution from the winner.

Task. The participants of the second contest faced the task "Performance Evaluation". We offered them to teach the computer to predict the multiplication time of two matrices of the size mｘk and kｘn on a test computing system. Participants knew how much time this problem was solved on other systems, matrix sizes and system parameters.

As a quality criterion for solving a problem, we used the smallest average relative error ( MAPE , in some sources referred to as MRE) for implementations that work longer than one second:

M A P E = f r a c 1 N s u m_{i = 1}^{N} f r a c | y_{i} - g_{i} | y_{i}, y_{i} > 1

$MAPE = \ frac {1} {N} \ sum_ {i = 1} ^ {N} \ frac {| y_ {i} - g_ {i} |} {y_ {i}}, y_ {i}> 1$

where N is the number of objects in the sample, y _i is the true operation time in the i-th experiment, g _i is the predicted time. The first place in the competition was taken by Mikhail Karachun, about his method of solving - further with his words.

Winner history

The general methodology for solving such problems is described in the document attached to the competition. The basic idea is to predict not the time to calculate time , but the value time / (m * n * k) . Moreover, this value within a single system fluctuates slightly, and when teaching a model on several systems, it is better to use nonlinear methods at once, for example, a random forest .

Step zero - common

Python was used for the solution, and the result was checked locally using the kfold method with 5 partitions. When preparing the data, all categorical signs underwent “one value sign - one column” transformation ( one hot )

Step one - model selection

The hyperopt module was used to select model parameters. It works better than a simple search on the grid , since it does not just go through the parameters, but tries to optimize them.

Example

 # -*- coding: utf-8 -*- import pandas as pd import numpy as np from sklearn.ensemble import GradientBoostingRegressor from sklearn.cross_validation import KFold from hyperopt import fmin, tpe, hp, STATUS_OK, Trials import random random.seed(1) def mean_absolute_percentage_error(y_true, y_pred): ind = y_true > -1 return np.mean(np.abs((y_true[ind] - y_pred[ind]) / y_true[ind])) def loss_func(y_true, y_pred): return mean_absolute_percentage_error(y_true,y_pred) all_train = pd.read_csv('~/Projects/DataMining/Bimbo/data/train1.csv') all_target = pd.read_csv('~/Projects/DataMining/Bimbo/data/y_train.csv') all_train['TARGET'] = all_target['time'] cols_to_drop = ['ID','TARGET'] cols = list(set(all_train.columns)-set(cols_to_drop)) print(len(cols)) def hyperopt_train_test(hpparams): all_results = [] kf = KFold(len(all_train['TARGET'].values),n_folds=5,random_state=1, shuffle=True) for train_index, test_index in kf: train = all_train.ix[train_index,:] test = all_train.ix[test_index,:] X_train = train[cols].values y_train_c = train['n'].values*train['m'].values*train['k'].values y_train = train['TARGET'].values X_test = test[cols].values y_test_c = test['n'].values*test['m'].values*test['k'].values y_test = test['TARGET'].values params_est = {'n_estimators':int(hpparams['n_estimators']), 'learning_rate':hpparams['eta'], 'max_depth':hpparams['max_depth'], 'min_samples_split':hpparams['min_samples_split'], 'min_samples_leaf':hpparams['min_samples_leaf'], 'loss':hpparams['loss'], 'alpha':hpparams['alpha'], 'subsample':hpparams['subsample'], 'random_state':1} bst = GradientBoostingRegressor(**params_est) bst.fit(X_train, np.log(y_train/y_train_c)) y_test_pred = np.exp(bst.predict(X_test))*y_test_c current_res = loss_func(y_test, y_test_pred) all_results.append(current_res) return np.mean(all_results) space4dt = { 'min_samples_split': hp.quniform('min_samples_split', 3, 14, 1), 'min_samples_leaf': hp.quniform('min_samples_leaf', 1, 7, 1), 'subsample': hp.quniform('subsample', 0.6, 0.99, 0.001), 'eta': hp.quniform('eta', 0.07,0.2, 0.001), 'n_estimators': hp.quniform('n_estimators', 10, 1000, 10), 'max_depth': hp.choice('max_depth', (4,5,6,7,8,9,10)), 'alpha': hp.quniform('alpha', 0.01, 0.99, 0.01), 'loss':hp.choice('loss', ('ls', 'lad', 'huber', 'quantile')), } def f(params): acc = hyperopt_train_test(params) print(acc) print(params) return {'loss': acc, 'status': STATUS_OK} trials = Trials() best = fmin(f, space4dt, algo=tpe.suggest, max_evals=2000, trials=trials) print 'best:' print best

After the first run with a small number of trees, GradientBoostingRegressor (loss = 'lad') performed best.

Step two - feature engineering

At the second stage, the task was set to eliminate unnecessary signs, since there were ~ 1100 of them altogether. The recursive selection method was used for this. It consists in the gradual elimination of N signs by evaluation based on cross-validation. At the output, the algorithm in the ranking_ parameter stores the stage where the sign was deselected: 1 - it means that it remains to the end, the more - the worse. The parameter support_ stores the mask of the selected features - that is, those that are in the ranking_ with a unit. It should be noted that the final variant is not always the best; sometimes, for solutions, signs can be used that were eliminated near the end of the selection. This procedure was carried out for a long time, for example, on my laptop with an average performance, it took more than 12 hours.

Example

 # -*- coding: utf-8 -*- from sklearn.feature_selection.rfe import RFECV from sklearn.ensemble import GradientBoostingRegressor import numpy as np bst = GradientBoostingRegressor(**params_est) selector = RFECV(bst, step=50, cv=5) selector.fit(all_train[cols], target) print(list(selector.ranking_ )) print(np.asarray(cols)[selector.support_ ])

As a result, the number of signs was reduced to about 10. Further, by trial and error, several more successful ones were found.

Example

 df.ix[:, 'cpuExtra1'] = 0 df.ix[df['cpuFull'] == 'Intel(R) Core(TM) i3-2310M CPU @ 2.10GHz', 'cpuExtra1'] = 1 df.ix[:, 'cpuExtra2'] = 0 df.ix[df['cpuFull'] == 'Intel(R) Atom(TM) CPU N550 @ 1.50GHz', 'cpuExtra2'] = 1 #     df.ix[:, 'm_div_n'] = df['m'] / df['n'] df.ix[:, 'magic'] = df['k'] * df['m'] * df['n'] / (df['cpuCount'] * df['cpuCount']) cols = [ 'n', 'Sequential_read_128B_by128', 'k', 'Random_write_3MB_by128', 'cpuCount', 'Sequential_write_32kB_by128', 'Random_read_9MB_by128', 'm', 'SeqRead_20kB_by256', 'cpuCores', 'Sequential_read_48MB_by128', 'Random_read_4MB_by32', 'Random_write_32MB_by128', 'Random_read_2MB_by32', 'SeqCopy10MB_by128', 'BMI', 'm_div_n', 'magic', 'cpuExtra1', 'cpuExtra2', 'Random_write_bypassing_cache_6kB_by128', 'Sequential_read_192kB_by32', ]

In total, there were 20 traits taken from the sample and two generated. The first was obtained by enumerating various functions of the dimension of matrices. The second sign deserves special attention, since, despite the apparent lack of logic, it contributes to the improvement of the result.

Step Three - ensembling

Ensembles If we combine decision trees trained on different subsets of features, then the result is more efficient than a single tree — this is how random forest works. But if you take several different ensembles and combine their solutions, this can also help. As the general practice and my personal experience shows, if you have several models with approximately the same result, then their average is almost always better. And the more these models differ in the logic of construction - the better. For example, if you decide to take two random forests with the same parameters, but different numbers of trees, this is unlikely to help. And if you take a random forest and gradient boosting regressor - it almost always turns out better, sometimes it is exactly what you need, if we are talking about two or three decimal places.

When solving this problem, I took the top models obtained by over-optimizing the parameters and went through their combinations. Especially well considered were models that gave an equally good average result for cross-validation, with the best and worst folds being different. As a result, three models remained, the average was used as the predicted values.

Learning different models on subsets of rows and / or columns did not give a quick result.

Example

 #     params_est = {'n_estimators': 370, 'subsample': 0.961, 'learning_rate': 0.076, 'min_samples_split': 18.0, 'max_depth': 6, 'min_samples_leaf': 8.0, 'random_state':1, 'loss':'lad',} bst1 = GradientBoostingRegressor(**params_est) bst1.fit(X_train, y_train/y_train_c1) params_est = {'n_estimators': 680, 'subsample': 0.902, 'learning_rate': 0.076, 'min_samples_split': 14.0, 'alpha': 0.29, 'max_depth': 9, 'min_samples_leaf': 5.0, 'loss':'quantile', 'random_state':1} bst2 = GradientBoostingRegressor(**params_est) bst2.fit(X_train, y_train/y_train_c1) params_est = {'n_estimators': 430, 'subsample': 0.978, 'learning_rate': 0.086, 'min_samples_split': 19.0, 'max_depth': 6, 'min_samples_leaf': 10.0, 'loss':'lad', 'random_state':1} bst3 = GradientBoostingRegressor(**params_est) bst3.fit(X_train, y_train/y_train_c1)

Step Four - we need to go deeper

Then I decided to check how much the forecast error differs depending on the different parameters of the computing system. A simple search showed that if the average error during cross-validation is ~ 0.05, then on one of the operating systems this error is ~ 0.30.

The first idea was to adjust the weights of objects during training, increasing them for a given OS, but the result only got worse. Since the data for this OS is quite small (<100), it is also impossible to set up a separate model - it will be retrained. An interim solution helped. I took a separate model, trained it in the entire sample, but with weights that gave priority to this OS. When calculating the final result, this model was used for only one OS. Those. the main models were trained on the whole sample without weights and predicted the result for all but one of the wasps. One model was trained on the entire sample with weights, and predicted for only one wasp. Here, for the first time, difficulties with cross-validation arose - a local improvement was not always confirmed by a public assessment. This is due to the fact that the number of examples for one OS is quite small, and if they are still divided into several parts for verification, then a stable result will not have to wait.

Example

 #      all_train['w'] = 1 all_train['w'][all_train['os'] == 15] = 4 #      os = 15 params_est = {'n_estimators': 480, 'subsample': 0.881, 'learning_rate': 0.197, 'min_samples_split': 3.0, 'max_depth': 7, 'min_samples_leaf': 2.0, 'loss':'lad', 'random_state':1} bst4 = GradientBoostingRegressor(**params_est) bst4.fit(X_train, np.log(y_train/y_train_c), sample_weight=train['w'])

Step five - last step

Having a fairly good set of models, another assumption was made that brought a fairly good result. If you need to minimize the relative deviation, why not predict it? Here I will try more. There are several options that serve the model as a reference example:

We teach the model, feeding it as an output directly time - the time of calculation. This is something that we refused immediately, following the advice of the authors of the article cited at the beginning.
We train the model by applying time / (m * n * k) to the output — that is, the angular coefficient calculated for each computing system. This is what we have been doing up to this point.
We teach the model by inputting the value time / reg_k * (m * n * k) . Those. we assume that for each computing system the time dependence of m * n * k is linear with the angular coefficient reg_k , and we train the model to predict the relative deviation from this model.

Medium time / (m * n * k) was taken as reg_k , signs of os + cpuFull were used as the identifier of the computing system, since it was on this combination that the linear model with the median gave the best result.

Example

 all_train.ix[:, 'c1'] = all_train['TARGET'] / (all_train['m'] * all_train['n'] * all_train['k']) all_train_median = all_train[['c1', 'os', 'cpuFull']].groupby(['os', 'cpuFull'], as_index=False).median() def preprocess_data(df): #    df = pd.merge(df, all_train_median, on=['os', 'cpuFull'], how='left', suffixes=('', '_med')) df.ix[:, 'test_mdeian'] = df['c1_med']*df['m']*df['n']*df['k'] return df

Step Six - rules rule

Also I will tell about a small hack, which, by the way, had a tangible impact on the final result. If you look closely at the evaluation formula, it is clear that measurements with a duration of less than a second are excluded from it. This means that all predicted values of less than one can be safely rounded up, since if they are really smaller, then the result is not included, and if not, you will reduce the error.

General impressions

It should be noted that in the course of all the improvements described in the article, the local assessment of the decision, the score in the public score, and as it turned out in the private score, were always changed in the direction of improvement. Here you can see the final working script .

Try yourself!

As always, we offer on the portal a training article for newbies and a serious task for experts. By the way, on the portal you can practice in solving previous contests - all of them are open in sandbox mode. Join us for registration !

Source: https://habr.com/ru/post/321016/

All Articles

ML Boot Camp III opening soon

Machine learning boot camp ii

Winner history

Try yourself!

More articles: