📜 ⬆️ ⬇️

Azure Machine Learning for Data Scientist

This article was created by our community friend, Dmitry Petukhov , Microsoft Certified Professional, developer of Quantum Art.
The article is part of a fraud detection cycle, the rest of the articles can be found in Dmitry’s profile.




Azure Machine Learning - a cloud service for performing predictive analytics tasks. The service is represented by two components: Azure ML Studio , a development environment accessible through a web interface, and Azure ML web services .
A typical data scientist's sequence of actions when searching for patterns in a dataset using learning algorithms with a teacher is depicted and described in detail below.


')
Projects in Azure ML Studio are called experiments . Let's create an experiment and look at the set of tools that Azure ML proposes to the data expert for each of the steps in the sequence illustrated above.

Data acquisition


The Reader control allows you to load both structured and semi-structured data sets. It supports loading from both relational DBMS (Azure SQL Database) and data from non-relational sources: NoSQL (Azure Table, Hive requests), OData services, and loading documents of various text formats from Azure Blob Storage and via URL (http protocol ).

Manual data entry is also possible ( Enter Data control). For the purpose of converting data from various formats, use the elements from the Data Format Conversation section. The following output formats are available: CSV, TSV, ARFF, SVMLight.

Data preparation


Data incompleteness / duplicate data

In the general case, the researcher deals with incomplete data - the training sample has null values ​​in the data. The Clean Missing Data control allows you to both remove the row / column containing the missing data and replace the missing value with a constant, average, median, mode.
It is often the case that the set contains duplicate data, which, in turn, can significantly reduce the accuracy of prediction of a future model. To remove duplicate data, use the Remove Duplicate Rows control .

Research data


Transformation and data cleansing

Data transformation is one of the stages that requires a large amount of manual work, especially if the data for the training sample is taken from various sources: local CSV, distributed file system (HDFS), Hive. The lack of tools that can be used to make requests to heterogeneous sources in a uniform way can significantly complicate the work of a data analyst.

After uploading data to Azure ML, the researcher does not face problems of unified access to heterogeneous data sources, but works consistently with data from various sources. In the Manipulation section, controls are available that allow you to perform inner / left / full join operations, projecting, adding and deleting columns, grouping data by predictors, and even arbitrary SQL transformations on loaded data sets ( Apply SQL Transformation control).

Determining the structure (metadata) of a dataset

The Metadata Editor control allows you to explicitly specify the type of data (string, integer, timestamp, etc.) contained in certain columns, attribute the contents of the column to predictors ( feature ) or answers to ( label ), and also specify the type of predictor scale: nominal ( categorical, categorical) or absolute.

The presence of patterns and anomalies

Azure ML Studio offers numerous statistical analysis tools (the Statistical Functions section in the toolbar). One of the most used by me is the Descriptive Statistics control. With it, you can get information about the minimum (Min) and maximum (Max) value stored in the column, the value of the median (Median), the arithmetic mean (Mean), the value of the first (1st Quartile) and third (3rd Quartile) quartiles, standard deviation (Sample Standard Deviation), etc.

Split dataset

When using teaching algorithms with a teacher, at least once per experiment (in general) you will have to divide the data set into two subsets: Training Dataset and Test Dataset .

For a positive end result — the creation of an exact model — it is very important that the training sample contains the widest possible range of values ​​that precedents can take (in other words, the training data set should cover the widest possible range of conditions that the predicted system can accept). To obtain the highest quality training sample, the most widely used strategy is the mixing of initial data.

For the split dataset tasks in Azure ML Studio, the Split control is used, which implements several data separation strategies and allows you to specify the proportions of the data that fall into each of the subsets.

Model building


Feature selection

Predictor selection ( Feature Selection ) is a stage that has a huge impact on the accuracy of the model obtained. To identify all the predictors that are essential within the model and, at the same time, to prevent the addition of too large predictors to the model, the researcher will need knowledge both in the field of mathematical statistics and in the subject area of ​​research.

The Filter Based Feature Selection control will allow you to identify predictors in the loaded dataset based on Pearson, Spearman, Kendall correlations, or other statistical methods. Identifying predictors using mathematical methods will help in the early stages as quickly as possible to create an acceptable model. At the final stage of refinement of the model, the choice of predictors is often made on the basis of expert opinion in the studied area. For explicit (manual) selection of predictors in Azure ML, the Metadata Editor tool is used, which allows you to specify that a dataset column be considered a predictor.

Feature Scaling / Dimension reduction

Some machine learning algorithms work incorrectly without normalizing the predictor values ​​( Feature Scaling ). In addition, reducing the number of variables / predictors available in the model ( Dimension reduction ) makes it possible to improve the utilization of resources during the execution of the training algorithm and to avoid retraining the model. Both of these techniques will reduce the search time of the objective function describing the model.
Elements from this functionality group are located in the Scale and Reduce section of the Azure ML Studio toolbar.

Application of machine learning algorithm

The process of applying the machine learning algorithm in Azure ML goes through the following steps:
initialization of a model using a specific machine learning algorithm (subsection Machine Learning -> Initialize Model ),
training model (Machine Learning -> Train )
evaluation of the obtained model for the training and test sample (Machine Learning -> Score )
evaluation of the obtained algorithm (Machine Learning -> Evaluate ).

Regression, classification, and clustering algorithms are available in Azure ML. It is possible to configure the key parameters of the selected algorithm: for the Multiclass Neural Network algorithm, you can set the number of hidden nodes, the number of learning iterations, initial weights, the type of normalization, etc. ( list of all configurable parameters ).

The complete list of algorithms for March 2015 is displayed in the illustration below.



Model evaluation


As mentioned above, to assess the model in the Azure ML Studio toolbar has a subsection Machine Learning -> Score . Moreover, the assessment result is available both in the form of histograms and in the form of statistical indicators (minimum, maximum value, median, average, mathematical expectation, etc.).

The Evaluate Model control contains a confusion matrix that contains correctly recognized good examples ( True Positive , TP), properly recognized bad examples ( True Negative , TN) and recognition errors ( False Positive, False Negative ).

Model performance evaluation is available both as a graph and as a table of metrics: Accuracy, Precision, Recall, F1 Score .

The greatest (but not the only) interest is the Accuracy prediction accuracy indicator, which is calculated as the ratio of all successful predictions to the total number of elements in the set: (TP + TN) / Total numbers .
The following figures illustrate the meaning of the remaining indicators:

The next most popular after Accuracy indicator is AUC (Area Under Curve). AUC is in the range of 0 to 1; values ​​close to 0.5 say that the model works with the same efficiency as if you threw up a coin and based on the fallout of one of the sides of the coin made an assumption to which class the event belongs. The closer the AUC to 1, the more accurate the model. For each level Threshold is your AUC schedule.
For more information about the performance of the algorithms in Azure ML can be found here .

Model publication


Models built and calculated in Azure ML Studio can be deployed as a scalable, fault-tolerant web service.

The service works in 2 modes: packet mode (asynchronous response from the service, SLA 99.9%) and Request / Response mode with low latency (synchronous response, SLA 99.95%).
The service receives and sends messages in the format application / json via https. To access the service, an API Key is issued - an access key included in the request header.
It is possible to add an arbitrary number of endpoints through which you can access the service. For each endpoint, you can configure the Throttle Level, which is certainly a virtue. The disadvantage is that there are only two levels - High and Low - and there is no way to set this level manually, say, at 10240 requests / sec. Another oddity is that all endpoints have a single API Key.

After creating the service, the service API documentation page is available, which, in addition to the general description of the service, descriptions of the formats of the expected input and output messages, also contains examples of calling the service in C #, Python and R.
In addition, a successful model can always be shared with the community in the Azure ML Gallery , in which there are already many interesting experiments. If your model is of great social value, then use the opportunity to publish a service that provides access to the model in the SaaS application store of Microsoft Azure Marketplace . In turn, Azure Marketplace already contains a large number of data-services that are available both for free and by subscription (for example, for every 10K requests).

disadvantages


In Azure ML, as in many services of the Azure cloud platform, there are several levels ( tier ) of service provision. In Azure ML, these are the Free and Standard levels. Free will cost you a minimum (almost zero) amount and is perfect for an initial acquaintance with the service. The Standart Level is an enterprise-level, free from a large number of artificial restrictions that the Free Tier has. Therefore, I will continue to talk only about Standart Tier.

I will not say that what I list below are restrictions, it is rather the things that remained unclear to me.

A fly in the ointment for the Azure ML Experiment


I did not find in the Azure ML documentation an indication of the maximum size of input data (in GB), whether there are (and what) limitations on the number of learning algorithms available in Azure ML for the number of columns (predictors) and rows (use cases). If these limitations exist, then the importance of this knowledge in designing an analytical system cannot be overestimated.

A fly in the ointment for Azure ML Web Services


Unknown: maximum number of simultaneous requests to one endpoint (endpoint) and maximum number of endpoints. In total, I found the following numbers in one place (I do not presume to judge their relevance): a maximum of 20 parallel requests for one endpoint, a maximum of 80 endpoints. I checked the call duration for one of my Azure ML web service located in the US Central South region (the client sending requests was in the same DC). The waiting time for a response in the Request / Response service mode is about 0.4 seconds.
From here you can calculate that the performance of more than 5K (20 * 80 * 1 / 0.4) requests per second, in my particular case, should not be expected. This restriction of application scalability must also be considered when designing.

And, finally, there is a lack of the ability to configure rights for each endpoint separately. But in order for these rights to be issued for each endpoint, a personal endpoint API Key (or other means of authentication) is needed, and this is not yet possible in Azure ML.

Killer Feature (instead of the conclusion)


It is worth noting that for some reason the functionality of Azure ML Studio’s embedded tools is not enough, researchers can write and execute scripts in the project written in R ( quickstart ) and Python ( quickstart ) - the most popular programming languages ​​in the field of scientific research.
And they say that all this can be tried for free . To whom it seems a little, here the prices for Free and Standart Tier .

Additional sources

Source: https://habr.com/ru/post/254637/


All Articles