Machine learning in RapidMiner

Dmitry Sobolev, Igor Masternoy, Raphael Zubairov

Not to notice how quickly the total amount of collected metrics is growing is simply impossible. Not only the frequency with which automatic systems collect data, the data storage bandwidth increases, but also the set of metrics that we can use is increasing. This trend is most pronounced in IoT, but other industries can boast a huge set of data sources - public or accessible by special subscription.
')
The increase in data creates new challenges for analysts and professionals working on optimizing business problems. The pace of development of the world economy is increasing, but it is precisely the rapid response to changes at the micro level that allows individual companies to expand. And here data analysis and machine learning tools come to the rescue.

In the 2000s, machine learning and in-depth data analysis were the lot of university groups and specialized start-ups. Today, any company has access to virtually unlimited and algorithms, approaches and ready-made solutions for the creation of automatic systems, as well as a whole set of products for data analysis.

Machine learning is now used not only by corporations like Microsoft and Google, even small companies can benefit from the benefits of high-quality data analysis or a recommendation system. If until recently, the use of such methods required the hiring of programmers, analysts, data scientists, now services and applications for machine learning appear on the market that allow us to process data and build predictive models using a graphical interface. Even a person with minimal knowledge in this area is able to use them.

Now the top three in automated and simplified machine learning consists of DataRobot, RapidMiner and BigMl. In this article we will look at RapidMiner in detail - we’ll tell you what it can do and how it can make your life easier.

Rapidminer

It is critically important for any business to evaluate the “work force” indicator in separate periods of time. This allows you to plan business projects that are always largely tied to human resources. An additional risk factor can serve as seasonal surges in catarrhal diseases: every year in winter a decent percentage of employees turn out to be sick. As a result, the deadlines for projects are shifted, and any company, of course, would like to avoid such changes. This can help machine learning.

With the help of RapidMiner, we analyze data on colds and build a model that can predict disease outbreaks. Based on the results of the forecast, the company will be able to take action in advance and avoid losses.

Let's get acquainted with the program:

Fig. 1 RapidMiner screen form.

On the left side of the screen are the data download panel and operator panel. RadpidMiner provides the ability to download data from a database or cloud storage (Amazon S3, Azure Blob, Dropbox). The set of operators for convenience is divided into categories:

data access (work with files, databases, cloud storages, Twitter streams );
operators of work with dataset attributes: type conversion, dates, set operations, etc .;
mathematical modeling operators: predictive models, cluster analysis models, optimization models;
auxiliary operators: launching Java and Groovy routines, anonymizing data, sending email messages, event schedulers.

We have described some of the main categories, each of which has its own subcategories and different variations of operators. It is worth paying attention to the possibility of adding operators from the ever-growing RapidMiner Marketplace. For example, among the available extensions there is an operator that allows you to convert data sets in the time series.

In the central part of the screen is the working area for creating data conversion processes. With the help of drag and drop, we add data to the process, which we will work with, and operators for data conversion, modeling, etc. By specifying the connections between the data and the operators, we set the process flow. Below the center panel with tips - based on the processes built by other users, she advises you which operation to perform next. Right panel with the parameters of the selected operation and detailed documentation of the parameters and principles of operation.

First, let's download the data (see Figure 2) on the number of Ukrainian search queries on Google related to the common cold. An example of the data you can see in Table 1 of the Appendix section.

Fig. 2 Type of data in Ukraine

The data represents the number of requests for the end of the week from 2005 to 2015. When importing data, you must set the date format for the correct construction of time schedules. Connect the output of the data block with the output point of the process (res). When you press the "start" button, the program will show general statistics. The results of the work are reflected in Fig. four.

Fig. 3 The process of forming general statistics.

Fig. 4 General statistics for Ukraine data.

Using the Charts tab, we construct a graph of data distribution (Fig. 5). The graph reflects the apparent frequency of the incidence of a cold: the first wave begins in the autumn, and we can observe the peak by February. Now let's take the data for Russia and see if the same periodicity remains in them, whether the outbreaks coincide with the periods we identified in Ukraine. To do this, we load new data and combine it with previously loaded data; we merge by the Date field using the “Join” operator.

In the graphs shown in Fig. 5 and 6, we can see that the cyclical nature is preserved and the incidence peaks practically coincide.

Fig. 5 The number of requests for colds since 2005.

Fig. 6 Data on colds for Ukraine and Russia.

Model building

Let us proceed to the construction of a model that will predict the number of cases in Ukraine. We will forecast the value of the series for the next week based on the values of the four previous weeks (approximately one month). In this article, we use a direct propagation neural network to predict a time series. The choice of neural networks is justified by the simplicity of the selection of model parameters and their further use. Unlike autoregression models and moving average neural networks do not require a correlation analysis of the time series.

In Fig. 7 shows a diagram of the process that allows predicting the values of the time series:

Fig. 7 The process of building a forecast in RapidMiner.

For the neural network operator to work correctly, it is necessary to convert the original time series into a training sample format. For this, we used the Windowing operator from the Series Extension package. Thus, from the value column, we get a table of the form:

Table 1. Presentation of the training sample for the neural network

Then, using the “Select Attributes” operator, we removed the extra fields from the selection (dates for values 1-4). Teaching a neural network with a teacher assumes the presence of a training and test sample, therefore, using the Split Data operator, we divided the BP in a ratio of 80 by 20. According to the documentation of the Neural Net operator, it is necessary that the column of predicted values in the training sample has a name / role “Label”, for which the “Set Role” operator was used. Since the column “Date of forecast” does not participate in forecasting, it is necessary to assign the role “Id” to it. The second output of the operator “Split Data” and the output “mod” of the operator “Neural Net” are connected to the corresponding inputs of the “ApplyModel”. The operator “Apply Model” feeds the input of the test model to the control sample and compares the predicted and real values. The final stage of our process is the “Performance” operator, which is necessary to determine the accuracy of the results. The predicted value obtained from the “Apply model” using “Set Role (2)” was assigned the role of “Prediction”.

Consider the parameters of the used neural network operators and computation errors. Experimentally, we arrived at the neural network architecture depicted in Fig. 8. The deep-neural network has 2 hidden layers: 4 neurons in the first and 12 in the second. Sigmoid was used as an activation function. The training was conducted on normalized input data with a learning coefficient of 0.5 and the number of cycles 1500.

Forecast Results

RapidMiner provides three artifacts as a result of our model:
model: its graphical representation, parameters and scales vectors;
the results of the estimated errors;
sample test data, supplemented by a column of predicted values.

Figure 8. Neural Network Architecture

Fig. 9 Graph of predicted and real values

In fig. 9 we can see the result of the prediction. As you can see, the graph with the predicted data is very close to real data. We estimate the results of the constructed model, calculating the prediction error using the formulas (1, 2).

where An is the real value, Fn is the predicted value

As a result of calculations, we obtained:

MAPE = 5.47% (3)

MAE = 21.748 (4)

findings

The massive introduction of machine learning technologies led to the creation of tools of varying degrees of complexity for the end user. The Rapid Miner program presented in the article lowers the threshold for entering Machine Learning technologies.

If you use this program, you do not need to be able to write code in Python or R. Rapid Miner strongly suggests the following action in the chain of data preparation, model training, its validation and accuracy assessment. It allows you to automatically correct some errors in the process, can help and explain some moments that are not completely clear to you.

When writing this article, we studied the functionality of RapidMiner. It is quite extensive and provides the ability to use complex neural network architectures and more fine-tuning of their parameters: selection of the activation function, configuration of neural connections of hidden layers, etc. The license allows you to perform calculations in the Rapid Miner cloud, which should reduce the training time and speed up the process Our characteristics. In addition, the license allows you to load more data and does not limit the user to ten thousand lines.

The mathematical model constructed in the article reached an error of about 6% on test data and, with some changes, can be used to predict the growth of colds. However, our main goal was to show the simplicity and conciseness of the program used.

Using Rapid Miner and a similar approach, any company can predict situations like outbreaks of colds. The preventive measures taken on the basis of the forecast reduce risks and ultimately increase profits.

List of materials used

Applications

Table 1. Sample data for Ukraine and Russia

Date	Ukraine	Russia
10/9/2005	359	296
10/16/2005	534	307
10/23/2005	672	329
10/30/2005	660	411
6/11/2005	596	417
11/13/2005	540	371
20/11/2005	503	316
27/11/2005	461	341
4/12/2005	453	362
12/12/2005	432	357
12/18/2005	422	415
25/12/2005	411	409
1/1/2006	404	436
8/1/2006	385	362
15/1/2006	366	327
22/1/2006	359	313
29/1/2006	358	304
5/2/2006	337	329
12/2/2006	329	344
19/2/2006	340	413

Source: https://habr.com/ru/post/337418/

All Articles