Solution of the problem "Performance Evaluation" mlbootcamp.ru

Less than three days left before the end of the "Performance Evaluation" competition. Perhaps this article will help someone to improve their decision. The essence of the task is to predict the time of multiplication of two matrices on different computing systems. As an estimate of the quality of the prediction, the smallest average relative error of MAPE is taken .

At the moment, the first place - 4.68%. Below I want to describe my way to 6.69% (and this is already 70+ place).

So, we have data for training in the form of a table with column 951. Such a huge number of signs do not even make sense to begin to analyze "manually." Therefore, we will try to apply some standard algorithm “without looking”, but with a little data preparation:

Attempt # 1

There are gaps only in memFreq, about 11%. Replace gaps with a mean;
Delete non-informative columns (in which one value of the feature);
Apply ExtraTreesRegressor.

These manipulations give mape = 11.22%. And this is 154 out of 362 place. Those. better than half the participants.
')

Attempt # 2

To apply linear algorithms, it is necessary to scale the attributes. In addition, sometimes it helps to add new features based on existing ones. For example, using PolynomialFeatures . Since calculating the polynomial for all 951 signs is extremely resource-intensive, we divide all the signs into two parts:

performance related;
associated with the matrix itself (mkn);

And we will calculate the polynomial only on matrix signs. In addition, the response vector (y) is prioritized before learning, and when calculating the answers we will return the previous scale.

A couple of simple manipulations are already given mape = 6.91% (ranked 80 out of 362). It is worth paying attention that the RidgeCV () model is called with standard parameters. In theory, it can still be tuned.

Attempt number 3

The best result mape = 6.69% (72/362) gave a “manual” addition of signs. Added three signs m * n, m * k, k * n.
This also added the ratio of the maximum dimension of the matrix to the minimum dimension for both matrices.

Code to reproduce the result

import numpy as np import pandas as pd from sklearn import linear_model def write_answer(data, str_add=''): with open("answer"+str(str_add)+".txt", "w") as fout: fout.write('\n'.join(map(str, data))) def convert_cat(inf,inf_data): return inf_data[inf_data == inf].index[0] X = pd.read_csv('x_train.csv') y = pd.read_csv('y_train.csv') X_check = pd.read_csv('x_test.csv') # memFreq  .     X.memFreq = pd.to_numeric(X.memFreq, errors = 'coerce') mean_memFreq = 525.576 X.fillna(value = mean_memFreq, inplace=True) X_check.memFreq = pd.to_numeric(X_check.memFreq, errors = 'coerce') X_check.fillna(value = mean_memFreq, inplace=True) #    for c in X.columns: if len(np.unique(X_check[c])) == 1: X.drop(c, axis=1, inplace=True) X_check.drop(c, axis=1, inplace=True) #    cpuArch_ = pd.Series(np.unique(X.cpuArch)) X.cpuArch = X.cpuArch.apply(lambda x: convert_cat(x,cpuArch_)) X_check.cpuArch = X_check.cpuArch.apply(lambda x: convert_cat(x,cpuArch_)) memType_ = pd.Series(np.unique(X.memType)) X.memType = X.memType.apply(lambda x: convert_cat(x,memType_)) X_check.memType = X_check.memType.apply(lambda x: convert_cat(x,memType_)) memtRFC_ = pd.Series(np.unique(X.memtRFC)) X.memtRFC = X.memtRFC.apply(lambda x: convert_cat(x,memtRFC_)) X_check.memtRFC = X_check.memtRFC.apply(lambda x: convert_cat(x,memtRFC_)) os_ = pd.Series(np.unique(X.os)) X.os = X.os.apply(lambda x: convert_cat(x,os_)) X_check.os = X_check.os.apply(lambda x: convert_cat(x,os_)) cpuFull_ = pd.Series(np.unique(X.cpuFull)) X.cpuFull = X.cpuFull.apply(lambda x: convert_cat(x,cpuFull_)) X_check.cpuFull = X_check.cpuFull.apply(lambda x: convert_cat(x,cpuFull_)) #     perf_features = X.columns[3:] #   X['log_mn'] = np.log(Xm * Xn) X['log_mk'] = np.log(np.int64(Xm*Xk)) X['log_kn'] = np.log(np.int64(Xk*Xn)) X['min_max_a'] = np.float64(X.loc[:, ['m', 'k']].max(axis=1)) / X.loc[:, ['m', 'k']].min(axis=1) X['min_max_b'] = np.float64(X.loc[:, ['n', 'k']].max(axis=1)) / X.loc[:, ['n', 'k']].min(axis=1) X_check['log_mn'] = np.log(X_check.m * X_check.n) X_check['log_mk'] = np.log(np.int64(X_check.m*X_check.k)) X_check['log_kn'] = np.log(np.int64(X_check.k*X_check.n)) X_check['min_max_a'] = np.float64(X_check.loc[:, ['m', 'k']].max(axis=1)) / X_check.loc[:, ['m', 'k']].min(axis=1) X_check['min_max_b'] = np.float64(X_check.loc[:, ['n', 'k']].max(axis=1)) / X_check.loc[:, ['n', 'k']].min(axis=1) model = linear_model.RidgeCV(cv=5) model.fit(X, np.log(y)) y_answer = np.exp(model.predict(X_check)) write_answer(y_answer.reshape(4947), '_habr_RidgeCV')

Afterword

I admit that I can only multiply matrices using Wikipedia. And the methods of Strassen and the algorithms of Vinogradov, indicated in the description of the assignment, for me are something unreal to master. For me, this is the first participation in a machine learning competition. And a sense of personal pride enhances the fact that the result obtained looks good against the background of the work referenced by the authors of the competition - A. A. Sidneva, V. P. Gergel “Automatic selection of the most efficient implementations of algorithms” .

Source: https://habr.com/ru/post/305872/

All Articles

Solution of the problem "Performance Evaluation" mlbootcamp.ru

Attempt # 1

Attempt # 2

Attempt number 3

Afterword

More articles: