Introduction to RapidMiner

At the moment there are many companies needing analytics systems, but the high cost and excessive complexity of this software in most cases forces us to abandon the idea of building our own analytical system in favor of the well-known Excel. Also, additional costs for training employees, maintaining expensive storage systems, etc. And here Open Source solutions can come to the rescue - there are not so many of them, but there are very worthy software, one of which is RapidMiner. RapidMiner (hereinafter simply “the miner”) is a tool created for the date of mining, with the basic idea that the miner (analyst) should not program when performing his work. At the same time, as you know, mining requires data, so it was provided with a fairly good set of operators solving a wide range of tasks for receiving and processing information from various sources (databases, files, etc.), and we can say with confidence that it is also full tool for ETL.

In addition to the miner itself, there is also a RapidMiner Server (formerly called RapidAnalytics, up to version 6) which can be used as a repository for storing and executing miner processes (including scheduled), “fumble” connections to data sources between users, send data from miner processes as a web service.

Unfortunately for you and me, with version 6, the miners decided to start making money on the sales of this software and changed the license from AGPL to Business Source. Nevertheless, version 5 of AGPL and we can use it freely and without restrictions. Therefore, it is she who will be considered in the article. We also note that in the sixth version there are not so many new operators and functions (perhaps the most interesting thing is cloud support), and for most tasks the RapidMiner 5 Community will suffice.
')

Installation

Not so long ago, from the official site, the download links for RapidMiner 5 have been removed, so we will assemble the RM from the source code which we take in the official github project .

To build RapidMiner from the repository we need

Go to the console, go to the directory where the miner would like to put, clone the repository

git clone https://github.com/rapidminer/rapidminer-5.git

the next step is to build a project

 ant build ant release.makePlatformIndependent

now let's run the miner

 .\scripts\RapidMinerGUI.bat

for Linux respectively

 ./scripts/RapidMinerGUI.sh

You will see a window like the image on the right. Click on the New Process and go on.

Basic concepts

Before looking at the basic principles of working with RapidMiner on an example, we will make a small introduction to its basic concepts.

Process

A set of operators interconnected in a predetermined order to perform the required data analysis / processing task.

Operator

The logical unit of the process. The operator performs some actions on the data, it has an input-output (the so-called "ports"), data comes to the input, the operator-processed data comes to the output. In this way, we can make data processing chains, for example, count client transactions from a database, find the largest ones, convert them into dollars, and return a result. At the same time, you can parallel chains - for example, in one we read transactions from different databases, and in the other we look for customer data, then we merge and get the result (it is also possible their parallel execution in time!).

In the program's interface, the Operators tab corresponds to the operators - where in the hierarchy they are grouped by function. To use the operator you need to click on it and move it to the process workspace.

Repository

RM storage space. It can be local as well as remote (RapidMiner Server), for which it is possible to execute server-side processes, multi-user access to database processes / connections, launch processes on a schedule, or upload data as a web service.

In the Repositories contribution to RM here you can see only Samples, DB and Local Repository. The first, as the name implies, is a set of processes - examples, DB - current connections to databases available in the miner (defined via Tools -> Manage Database Connections) and Local Repository, a place to store your own processes on a computer.

Process context

The context corresponds to the Context tab where we can see three sections:

Process input - data transmitted to the input process. Here you can specify the path to the data inside the repository.
Process output - here you can see the path in the repository, where the result of the process will be saved.
Macros is a global variable available in the process from anywhere. It can only accept strings or numbers as values.

Note that Process input and Process output are marked in the process by circles on the process boundary with the inscriptions inp and res . To use data from the input or to save it, you need to connect the corresponding circle with the input / output of operators.

The best training is practice. Let's make a small process on the basis of which we will see the basic principles of working with the miner.

Small task

You are the director of a small company that is creating websites, industrial design, etc. Quite often, due to the large number of orders and lack of staff, you hire freelancers from different countries (because customers from all over the world) and regularly enter information about the work performed in the Excel table indicating the name of the artist, type of work, date of payment, amount and currency of payment . At some point you wanted to get the amount of expenses in rubles (for the Central Bank rate), which you incurred by type of work on a specific date (more interesting cases are broken down by months, employees are left to their own experiments).

The first thing we will do is save our Excel file in CSV format and open it for reading in RapidMiner. To do this, take the Read CSV operator (Import -> Data -> Read CSV) and drag it to the work area of the process. Next, click on it and see the operator settings on the right. Click on the open folder icon

In the dialog box, select the file we need (the CSV used in the example can be downloaded by reference )

Pay attention to the pressed button.

- expert mode. In it, additional parameters are available for operators, as a rule they are almost always needed and marked with italics.

We set the parameters as in the picture on the right and click on the Edit list to the right of the data set meta data data below. We expose everything as in the picture below.

As you can guess here we expose the names of the columns, a check mark is set to exclude or include the column from the result of parsing, type and role. Roles other than attribute may be needed in mining, in the usual case, they are usually not required.

Click Apply and go to the next step. Add the Filter examples (Data Transformation-> Filtering) operator, connect its input with the Read CSV output, and exit with the process output indicated by a circle and the ins res . You get such a picture

With the help of the added operator, we will select records only for the specified date which we will declare as a process macro. Go to the Context tab of the process, there we find the Macros section and click on

. In the Macro column, we write the date, and in the Value the desired date, let it be 06/30/2012.

So the Context tab at this step will look like the one on the right. We defined the macro (remember, a global variable) and now we will use it to filter records by date from our CSV shnichka. Click on the Filter operator. Select examples in the condition class attribute_value_filter and write in the parameter string : date =% {date}. On the left we indicated the name of the column on which the filtering takes place, in the center the operation of checking for equality and on the right the taking of the value from the macro.

Let's see what happened. Click on the start button of the process

and miner switching to the Result perspective (if this does not happen click on

) will display the filtered data on July 30, 2012.

The first result was obtained, but we would like to see the costs in rubles at the rate of the Central Bank of the Russian Federation. Switch to Design Perspective by clicking on

and add the Open file statement (Utility -> Files -> Open file). Click on it and set the following settings

Where url: http://www.cbr.ru/scripts/XML_daily.asp?date_req=%{date}
Note that we substituted the macro in the operator parameter.

We will receive the data, but something must convert them into ExampleSet - i.e. table with data. In the first case, this role was performed by Read CSV, and now , as it is not difficult to guess, we will use Read XML (Import -> Data -> Read XML). We pull the operator, connect its input with the output of the operator Open file and make the following settings (if you experience difficulties with xpath, use the import wizard by clicking on the Import configuration wizard).

Pay attention that the ticked parse numbers is set and the comma is set with the integer and fractional separator.

You need to determine what attributes RapidMiner will take for ExampleSet . Click on Edit enumeration to the right of xpath for attributes, add two entries

Value [1] / text () - the value in rubles of a unit of currency
CharCode [1] / text () - alphabetic currency code

Now you need to set the value types for the attributes. To do this, click on the Edit list to the right of the data set meta datainformation and set it as in the picture below.

At this stage, we have a process that you should look like.

It's time to do the conversion of currencies in the data filtered by date. To do this, as you can guess, we will need to somehow combine the quotes and data. The Join operator (Data Transformation -> Set Operations -> Join) will help us in this. Now we do the following. We take the output of the Filter examples operator, which is currently connected to the output of the process and are connected to the Join operator, we do the same with the Read XML operator.

Now we click on the Join operator and determine how exactly the data will be merged. We remove the use id attribute as key checkbox, since the union takes place across the currency field, a new key attributes parameter will appear on its left click on the Edit list , in the Add entry dialog and in both fields we will write - currency . Save the changes. We can see what happened, in the same way as it was done above by clicking on the button

. The result will be

We are getting closer to our cherished goal - to find out how much we spent in rubles on our tasks. There is the final touch, the actual conversion itself. Add the Generate Attributes operator (Data Transformation -> Attribute Set Reduction and Transformation -> Generation) to the process and connect its input with the output of the Join operator, and the first output near which is written exp (abbreviated as ExampleSet ) to the output of the process. As is clear from the name of the operator, his task is to add a new attribute, to do this, click on the operator and on the right in its settings on the Edit list , the button opposite function descriptions . Give the name of the attribute and how to count it

Save the changes and execute the process, our result

Hooray! Here it is a treasured figure of costs in rubles that we incurred at the rate of the Central Bank on the date of payment. It is possible to develop this task very far, for example, to make a conclusion of information for the month, grouped by type of work, performer or dates. In general, plenty of imagination.

Useful materials

Source: https://habr.com/ru/post/269427/

All Articles