
At the moment there are many companies needing analytics systems, but the high cost and excessive complexity of this software in most cases forces us to abandon the idea of ​​building our own analytical system in favor of the well-known Excel. Also, additional costs for training employees, maintaining expensive storage systems, etc. And here Open Source solutions can come to the rescue - there are not so many of them, but there are very worthy software, one of which is RapidMiner.
RapidMiner (hereinafter simply “the miner”) is a tool created for the date of mining, with the basic idea that the miner (analyst) should not program when performing his work. At the same time, as you know, mining requires data, so it was provided with a fairly good set of operators solving a wide range of tasks for receiving and processing information from various sources (databases, files, etc.), and we can say with confidence that it is also full tool for ETL.
In addition to the miner itself, there is also a RapidMiner Server (formerly called RapidAnalytics, up to version 6) which can be used as a repository for storing and executing miner processes (including scheduled), “fumble” connections to data sources between users, send data from miner processes as a web service.
Unfortunately for you and me, with version 6, the miners decided to start making money on the sales of this software and
changed the license from AGPL to Business Source. Nevertheless, version 5 of AGPL and we can use it freely and without restrictions. Therefore, it is she who will be considered in the article. We also note that in the sixth version there are not so many new operators and functions (perhaps the most interesting thing is cloud support), and for most tasks the RapidMiner 5 Community will suffice.
')
Installation
Not so long ago, from the official site, the download links for RapidMiner 5 have been removed, so we will assemble the RM from the source code which we take in the
official github
project .
To build RapidMiner from the repository we need

Go to the console, go to the directory where the miner would like to put, clone the repository
git clone https://github.com/rapidminer/rapidminer-5.git
the next step is to build a project
ant build ant release.makePlatformIndependent
now let's run the miner
.\scripts\RapidMinerGUI.bat
for Linux respectively
./scripts/RapidMinerGUI.sh
You will see a window like the image on the right. Click on the New Process and go on.
Basic concepts
Before looking at the basic principles of working with RapidMiner on an example, we will make a small introduction to its basic concepts.
Process
A set of operators interconnected in a predetermined order to perform the required data analysis / processing task.
Operator

The logical unit of the process. The operator performs some actions on the data, it has an input-output (the so-called "ports"), data comes to the input, the operator-processed data comes to the output. In this way, we can make data processing chains, for example, count client transactions from a database, find the largest ones, convert them into dollars, and return a result. At the same time, you can parallel chains - for example, in one we read transactions from different databases, and in the other we look for customer data, then we merge and get the result (it is also possible their parallel execution in time!).
In the program's interface, the Operators tab corresponds to the operators - where in the hierarchy they are grouped by function. To use the operator you need to click on it and move it to the process workspace.
Repository

RM storage space. It can be local as well as remote (RapidMiner Server), for which it is possible to execute server-side processes, multi-user access to database processes / connections, launch processes on a schedule, or upload data as a web service.
In the Repositories contribution to RM here you can see only Samples, DB and Local Repository. The first, as the name implies, is a set of processes - examples, DB - current connections to databases available in the miner (defined via Tools -> Manage Database Connections) and Local Repository, a place to store your own processes on a computer.
Process context

The context corresponds to the Context tab where we can see three sections:
- Process input - data transmitted to the input process. Here you can specify the path to the data inside the repository.
- Process output - here you can see the path in the repository, where the result of the process will be saved.
- Macros is a global variable available in the process from anywhere. It can only accept strings or numbers as values.
Note that
Process input and
Process output are marked in the process by circles on the process boundary with the inscriptions
inp and
res . To use data from the input or to save it, you need to connect the corresponding circle with the input / output of operators.
The best training is practice. Let's make a small process on the basis of which we will see the basic principles of working with the miner.
Small task
You are the director of a small company that is creating websites, industrial design, etc. Quite often, due to the large number of orders and lack of staff, you hire freelancers from different countries (because customers from all over the world) and regularly enter information about the work performed in the Excel table indicating the name of the artist, type of work, date of payment, amount and currency of payment . At some point you wanted to get the amount of expenses in rubles (for the Central Bank rate), which you incurred by type of work on a specific date (more interesting cases are broken down by months, employees are left to their own experiments).

The first thing we will do is save our Excel file in CSV format and open it for reading in RapidMiner. To do this, take the
Read CSV operator (Import -> Data -> Read CSV) and drag it to the work area of ​​the process. Next, click on it and see the operator settings on the right. Click on the open folder icon

In the dialog box, select the file we need (the CSV used in the example can be downloaded by
reference )
Pay attention to the pressed button.

- expert mode. In it, additional parameters are available for operators, as a rule they are almost always needed and marked with italics.
We set the parameters as in the picture on the right and click on the
Edit list to the right of the
data set meta data data below. We expose everything as in the picture below.
As you can guess here we expose the names of the columns, a check mark is set to exclude or include the column from the result of parsing, type and role. Roles other than attribute may be needed in mining, in the usual case, they are usually not required.
Click
Apply and go to the next step. Add the
Filter examples (Data Transformation-> Filtering) operator, connect its input with the
Read CSV output, and exit with the process output indicated by a circle and the ins
res . You get such a picture

With the help of the added operator, we will select records only for the specified date which we will declare as a process macro. Go to the
Context tab of the process, there we find the
Macros section and click on

. In the
Macro column, we write the date, and in the
Value the desired date, let it be 06/30/2012.
So the
Context tab at this step will look like the one on the right. We defined the macro (remember, a global variable) and now we will use it to filter records by date from our CSV shnichka. Click on the
Filter operator. Select
examples in the
condition class attribute_value_filter and write in the
parameter string : date =% {date}. On the left we indicated the name of the column on which the filtering takes place, in the center the operation of checking for equality and on the right the taking of the value from the macro.
Let's see what happened. Click on the start button of the process

and miner switching to the
Result perspective (if this does not happen click on

) will display the filtered data on July 30, 2012.
The first result was obtained, but we would like to see the costs in rubles at the rate of the Central Bank of the Russian Federation. Switch to Design Perspective by clicking on

and add the
Open file statement (Utility -> Files -> Open file). Click on it and set the following settings
Where url:
http://www.cbr.ru/scripts/XML_daily.asp?date_req=%{date}
Note that we substituted the macro in the operator parameter.
We will receive the data, but something must convert them into
ExampleSet - i.e. table with data. In the first case, this role was performed by
Read CSV, and now
, as it is not difficult to guess, we will use
Read XML (Import -> Data -> Read XML). We pull the operator, connect its input with the output of the operator
Open file and make the following settings (if you experience difficulties with xpath, use the import wizard by clicking on the Import configuration wizard).

Pay attention that the ticked
parse numbers is set and the comma is set with the integer and fractional separator.
You need to determine what attributes RapidMiner will take for
ExampleSet . Click on
Edit enumeration to the right of
xpath for attributes, add two entries
Value [1] / text () - the value in rubles of a unit of currency
CharCode [1] / text () - alphabetic currency code
Now you need to set the value types for the attributes. To do this, click on the
Edit list to the right of the
data set meta datainformation and set it as in the picture below.
At this stage, we have a process that you should look like.
It's time to do the conversion of currencies in the data filtered by date. To do this, as you can guess, we will need to somehow combine the quotes and data. The
Join operator (Data Transformation -> Set Operations -> Join) will help us in this. Now we do the following. We take the output of the Filter examples operator, which is currently connected to the output of the process and are connected to the
Join operator, we do the same with the
Read XML operator.
Now we click on the
Join operator and determine how exactly the data will be merged. We remove the
use id attribute as key checkbox, since the union takes place across the
currency field, a new
key attributes parameter will appear on its left click on the
Edit list , in the
Add entry dialog and in both fields we will write -
currency . Save the changes. We can see what happened, in the same way as it was done above by clicking on the button

. The result will be
We are getting closer to our cherished goal - to find out how much we spent in rubles on our tasks. There is the final touch, the actual conversion itself. Add the
Generate Attributes operator (Data Transformation -> Attribute Set Reduction and Transformation -> Generation) to the process and connect its input with the output of the
Join operator, and the first output near which is written
exp (abbreviated as
ExampleSet ) to the output of the process. As is clear from the name of the operator, his task is to add a new attribute, to do this, click on the operator and on the right in its settings on the
Edit list , the button opposite
function descriptions . Give the name of the attribute and how to count it
Save the changes and execute the process, our result
Hooray! Here it is a treasured figure of costs in rubles that we incurred at the rate of the Central Bank on the date of payment. It is possible to develop this task very far, for example, to make a conclusion of information for the month, grouped by type of work, performer or dates. In general, plenty of imagination.
Useful materials