We solve the “First Open Contest” from Mail.ru on Data Science using Azure ML (introduction to Azure ML)

Now there is an ML Boot Camp competition in which it is necessary to predict the time for which 2 matrices of sizes mｘk and kｘn will be multiplied on this computing system, if it is known how much this problem was solved on other computing systems with different matrix sizes ( exact rules ). Let's try to solve this regression problem not using standard tools and libraries (R, Python and panda), but using a cloud product from Microsoft: Azure ML . For our purposes, free access is suitable, for which even a trial Azure account is sufficient. Anyone who wants to get a quick guide to setting up and using Azure ML in general, and ML Studio in particular, using the example of solving real live tasks, are invited under cat.

Creating data sources

Open ML Studio:

We will create one new experiment (in terms of Azure ML is a complete solution of the problem - from reading the input data to receiving the answer, then it can be converted to Web Service) and two new data sources (dataset) to represent the input data (one for signs, another for values). Download the training sample csv files from the ML Boot Camp website (x_train.csv and y_train.csv). To add a data source, select “Dataset” in the menu on the left and click “New” in the lower left corner, this window will appear:
')

Specify the path to the file x_train.csv, give this data source the name x_train. Also create the y_train data source. Now both of these data sources are shown on the “Datasets” tab:

Creating an experiment, the choice of characteristics

It's time to create an experiment, for this, in the menu on the left, select the item “Experiments”, click “New” in the lower left and select “Blank Experiment”. In the line above, you can give it a suitable name, in the end we get the following scope for our Data Science operations:

As you can see, on the left is a menu that lists all possible operations that can be added to the experiment, such as: data entry and output, selection of columns, various regression methods, classifications, etc. All of them will be added to our experiment by simply dragging the mouse and connecting different operations together.
Now we need to show what we want to use as input to the task. In the menu on the left, select the topmost item “Saved Datasets”, then “My Datasets”, select the x_train and y_train data sources created by us and drag them to the experimental area, resulting in:

Now we need to combine the columns of these two data sources, because all the Azure ML methods work with a single table (data frame) in which to specify the column, which is the learning value. To do this, we use the module "Add Columns". Hint: searching by modules will help you find a module by keywords or make sure that such a module does not exist yet. Drag the “Add Columns” operation to the workspace and connect its two upper data entry points with our x_train and y_train data sources, respectively. This operation has no parameters, so you don’t need to configure anything else. We get:

To see how our data now looks. Run the experiment by clicking the “Run” button on the bottom line. After the experiment is successfully completed, you can click on the output of the operation “Add Columns” and select the action “Visualize”:

The properties window allows you to see the columns, the first lines, for each attribute: average, median, histogram, etc. We see that in our table there are 952 columns (signs), and from them it is necessary to choose meaningful ones, which will help us in solving our problem. The selection of features is one of the most complex and non-deterministic operations in Data Science, so for now, for simplicity, we’ll select a few features that at first glance are significant. The module that helps us do this is called “Select Columns in Dataset”. Add it to the workspace, connect with the operation "Add Columns". Now, in the parameters "Select Columns in Dataset" we indicate which signs we want to leave. To do this, select the module "Select Columns in Dataset", in the properties on the right pane, click "Launch column selector":

Now we add the names of the columns that we want to leave (this is not the optimal choice of columns at all), do not forget to add the “time” column:

Let's run the experiment again, make sure that only the columns that are selected are left in the resulting table. Now the last step in data preparation: we divide the data into a training and test sample in the proportion of 70:30. To do this, we will find and place in the working area the module “Split Data”, in its settings we will set “Fraction of rows in the first output dataset” equal to 0.7. We get:

Use of algorithms

Now we are finally ready to use some kind of regression method. Methods are listed in the left menu: “Machine Learning”, “Initialize Model”, “Regression”:

First, let's try the forest decision tree method: “Decision Forest Regression”. Add it to the workspace, as well as the “Train model” module. This module has two inputs: one connects with the algorithm (in our case with “Decision Forest Regression”), the other connects with the training sample data (left output of the “Split Data” module). The experiment now looks like this:

The red circle in the Train model module tells us that it has required parameters that we did not tune up: it needs to indicate which sign we are trying to predict (in our case, this time). Click "Launch column selector", add a single time column. Note that the method itself has default settings that allow it to run without manual configuration. Of course, to get good results, try different combinations of parameters that will be specific to each method. Now the experiment can be started, the forest of trees will be built, they can even be viewed by calling the already familiar “Visualize” window. After training the model, it would be good to test it on a test (validation) sample, which represents 30% of the initial data. To do this, we use the Score Model module, connecting its first input with the output of the Train model module (trained model), and the second with the second output of the Split Data module. Now the sequence of operations looks like this:

You can run the experiment again and see the output of the “Score model”:

Two new columns were added: “Scored Label Mean” (average of the predicted value) and “Scored Label Standard Deviation” (standard deviation of the predicted value from the actual). You can also build a scatter plot (scatter plot, scatter plot) for the predicted and actual values (visible in the figure). Now we learn its accuracy with the help of the module “Evaluate Model”, which we will connect with the module “Score Model”.

The output of the “Evaluate Model” module contains information on the accuracy of the method on our verification data, including absolute and relative errors:

Of course, the method is not perfect, but we didn’t do it at all.

Adding a new method and comparing methods

Let's try another method based on decision trees: “Boosted Decision Tree Regression”. In the same way as for the first method, add the “Train Model” and “Score Model” modules, run the experiment, see the output of the “Score Model” module for the new method. Note that only one column was added, representing the predicted value: “Scored Labels”, for it you can also build a scatter chart:

Now we will compare the accuracy of these two methods, using the already added module “Evaluate Model”, for this we will connect its right input with the conclusion “Score Model” of the second method. As a result, we obtain the following sequence of operations:

Let's look at the output of the “Evaluate Model” module:

Now we can compare the methods among themselves and choose the one, the accuracy of which (in the sense necessary for our task) is higher.

We solve the problem with real data

We have trained methods, we know their accuracy - it's time to test them in battle. Let's download the file x_test.csv, which contains the data for which we must predict the time of matrix multiplication. To use the trained method, we need:

Add a new data source with the name x_test and data from the x_test.csv file.
Drag the new x_test data source to the experiment workspace.
Now we need to leave only those columns that took part in the training, copy the Select Columns in Dataset module, and remove the time column from the list of columns (since it is not in our test data).
Now we can run our trained method on the prepared data, for this we add the operation “Score Model”, connect its first input with the output of the Train Model module of the Boosted Decision Tree Regression method, the second input with the output of the newly added Select Columns in Dataset.
Now it remains to bring the data to a format that can be downloaded as a solution to the ML Boot Camp website. To do this, we will add another module “Select Columns in Dataset”, in which we will select just one column - our predicted values of “Scored Labels”, and add the module “Convert to CSV” to its output.

As a result, we obtain the following experiment:

You can download the received csv file by clicking on the output of the “Convert to CSV” module and selecting the “Download” item. Now we will delete the first line (with the name) from the received csv, upload it to the ML Boot Camp website. Works! But accuracy leaves much to be desired.

Further optimization

Consider several modules that will help improve the accuracy of the regression.

Try different methods that can be found in the menu on the left.
The Filter Based Feature Selection module, which tries to select the features with the most predictive ability (several different methods, which are specified in its properties), will help select the signs. This module is added instead of the “Select Columns in Dataset” module.
The “Permutation Feature Importance” module, which accepts a trained model and a set of test data as input parameters, will help assess which features are more useful in an already trained model.
The “Tune Model Hyperparameters” module will help you to choose the parameters of the method, which will conduct the specified number of method launches with different parameter sets and show the accuracy of each run.
As a heavy artillery, you can use any R and Python scripts using the modules "Execute R Script" and "Execute Python Script", respectively.

Conclusion

I like Azure ML, it allows you to quickly prototype a solution to a problem, and then delve into tuning and optimizing its solution.

The experiment is posted to the gallery and is open to everyone at: gallery.cortanaintelligence.com/Experiment/ML-Boot-Camp-from-Mail-ru-1

Take part in the contest! Anyone who can get a MAPE error less than 0.1, write, the author will be pleased.

Source: https://habr.com/ru/post/304186/

All Articles