📜 ⬆️ ⬇️

MLBootCamp "Performance Evaluation". Very simple and quick solution.

In this post I want to share my idea of ​​solving the MLBootCamp “Performance Evaluation” task from Mail.ru. The main advantage of this method is its simplicity and speed of the script execution. And although he will not be able to compete exactly with the winners of the competition (my congratulations!), It may be useful in practice if a few tenths of a percent is not critical, or a starting point for further development. The script is written in R.


1. Download and prepare data


Briefly about the competition, here's how the task was set:
"We suggest you teach a computer to predict how many seconds two matrices of the size mxk and kxn will multiply on this computing system, if you know how many times this problem was solved on other computing systems with different matrix sizes."

It is worth noting immediately that strictly speaking it is not. There is not a single computing system in the test sample, which would not be in the training. The average percentage error ( MAPE ) is used as a metric.

And now a little more. In both samples, there are 951 traits for each observation. The first three of them are the dimensions of the multiplied matrices, the rest are the parameters of the systems on which the measurements were made. It turns out that if we discard the parameters of the multiplied matrices, then in the training sample there are 92 unique configurations of the used computing systems, i.e. the remaining 948 parameters have only 92 combinations. Replacing these 948 parameters with one, the configuration identifier, we get that in our data there are only 4 signs for each observation. We will prepare the data for further work by adding the product of all matrix sizes. Now there are signs in the data, m , k , n , m * k * n and conf (computer identifier). Also, the training sample contains the target variable time - the time over which the matrix multiplication was performed.
')
library(quantreg) library(dplyr) #    y <- read.csv('y_train.csv')[,1] X.train <- read.csv('x_train.csv', stringsAsFactors = FALSE) X.test <- read.csv('x_test.csv', stringsAsFactors = FALSE) #      X.all <- rbind(X.train, X.test) X.uni <- unique(X.all[,4:951]) #           X.uni <- cbind(conf=paste0('conf', formatC(1:nrow(X.uni), 2, flag = '0')), X.uni) confs <- unique(X.uni$conf) #          #          cols <- c('m', 'k', 'n', 'conf') X.train <- dplyr::inner_join(X.train, X.uni, by=colnames(X.uni)[2:ncol(X.uni)])[cols] X.train[,1:3] <- lapply(X.train[,1:3], as.numeric) # int->numeric X.train$mkn <- X.train$m * X.train$k * X.train$n X.train$time <- y #      X.test <- dplyr::inner_join(X.test, X.uni, by=colnames(X.uni)[2:ncol(X.uni)])[cols] X.test[,1:3] <- lapply(X.test[,1:3], as.numeric) # int->numeric X.test$mkn <- X.test$m * X.test$k * X.test$n 


2. Preliminary data analysis and selection of a working model.


Now you can look at the data and see the expected dependence of the target time variable on m * k * n . For example, configuration 39 (left). It is well approximated by linear regression. However, not for all configurations the linear model can give a good result, for example, configuration 11 (on the right):

image


Noise and emissions make a significant negative contribution to the model, so one of the options for dealing with them is to remove such observations. The disadvantage of this method is to decide how far the observation must be “wrong” so that it can be removed. Moreover, for some systems there is very little data and it is wasteful to throw out even noisy data.

An alternative is to use models that are resistant to emissions and noise. Examples of such models are quantile and robust regressions.
In R, they can be found in the quantreg and MASS packages, respectively. Each of them has its own settings, and / or more advanced variations. Here these methods were used with default parameters. Here is how the graph with all regressions for a noisy configuration 11 will look like:

image


3. Training and test models


The scatter of the values ​​of the target variable varies widely for each of the systems, so the solution suggests itself: to train a stack of models, one for each configuration. Let the dependence in each model be a little more difficult, namely, the time spent on multiplying the matrices will depend not only on the product of their sizes, but also on each of them separately, i.e. time ~ m + k + n + mkn :

 models <- list() for (i in 1:length(confs)){ #   X.conf <- subset(X.train, conf==confs[i]) X.conf$conf <- NULL #        fit <- rq(time ~ ., data=X.conf) models[[confs[i]]] <- fit } 

Now it is easy to calculate the predictions by going through the models and choosing from the dataset the data corresponding to each individual configuration and its model.

 y.test <- rep(1, nrow(X.test)) y.train <- rep(1, nrow(X.train)) #     for (i in 1:length(confs)){ X.conf <- subset(X.train, conf==confs[i]) y.train[as.numeric(rownames(X.conf))] <- predict(models[[confs[i]]]) } #     for (i in 1:length(confs)){ X.conf <- subset(X.test, conf==confs[i]) y.test[as.numeric(rownames(X.conf))] <- predict(models[[confs[i]]], newdata = X.conf) } 

Due to the non-ideality of the source data and, therefore, the models obtained in the predicted values ​​will be numbers less than 1, which, by the condition of the problem, should not be. Here it turns out to be useful to use the predictions on the training sample and replace all inappropriate values. A simple and effective method is to replace all the numbers less than a certain constant cutoff by itself. The graph shows the metric graph (MAPE) versus the cutoff cutoff value . The horizontal line is a metric with no replacement, the vertical line is a cutoff at which the metric takes the smallest value.

image


Using such a replacement can slightly improve the value of the metric. As practice has shown, the cutoff level in the training and test samples is approximately the same.

The obvious disadvantage of the above code is that the models were built without cross-validation, therefore, the obtained value of the metric in the training sample cannot be trusted, as well as the evaluation on the public leaderboard. The deficiency is eliminated by an additional corresponding code.

However, there is also a clear advantage - in speed (the author’s script was completed in half a second), which makes it possible in a short time to evaluate many different variants of model combinations. For comparison, models using gradient boosting, random forests or neural networks were built from at least several tens of minutes.

Nevertheless, the algorithm with a choice between quantile and robust regression for each configuration showed the result of MAPE = 5.23% on a public leaderboard. According to the results of the competition, there is no metric, but on the basis of cross-validation data, MAPE is expected to have about 5.4%.

4. Conclusion


Thus, the training of the stack of models and the prediction of the target variable are performed very quickly, in less than a second and at the same time good accuracy is achieved. The described method can be improved due to more detailed attention to a particular configuration, for example, selecting a quantile for each regression or parameters when using robust methods, removing obvious outliers, selecting a separate cut-off, etc., which, as a rule, does not greatly affect runtime script.

In the modest opinion of the author, the use of quantile (or other emission-resistant) regression to predict the execution time of computer experiments is very practical, accurate, and far less computationally expensive than the method given in the article for the competition, which used linear regression with bike?) and random woods.

Source: https://habr.com/ru/post/306198/


All Articles