📜 ⬆️ ⬇️

RapidMiner - Data Mining and BigData in your home, quickly and without preparation (almost)

So far, marketers are daubed with BigData and run in this form at press conferences, I suggest simply downloading a free tool with test datasets, process templates and start working.

Download, install and get the first results - 20 minutes maximum.
I’m talking about RapidMiner , an open-source environment that, for all its free of charge, doesn’t “do” commercial competitors. True, I’ll say right away that the developers still sell it, and only the penultimate versions give it to the open source. You can try at home because there are generally free assemblies with all-logic with only two restrictions - the maximum amount of memory used is 1 GB and work only with regular files (csv, xls, etc.) as a data source. Naturally, in a small business is also not a problem.

What you need to know about RapidMiner

Here is the interface. You drop the data, and then simply drag the operators into the GUI, forming the data processing process. From you - just an understanding of what you are doing. All code takes on the environment. "Under the hood," you can, of course, climb, but in most cases it is just not necessary.

Important features


RapidMiner vs IBM SPSS Modeler

RM has a much broader processing capabilities, trite more nodes. On the other hand, SPSS has “autopilot” modes. Auto-models (Auto Numeric, Auto Classifier) ​​- go through several possible models with different parameters, choose some of the best. A not very experienced analyst can build on such an adequate model. It will almost certainly be inferior in accuracy built by an experienced specialist, but there is a fact - you can build a model without understanding this. RM has a counterpart (Loop and Deliver Best), but it still requires at least a choice of models and selection criteria for the best. Automatic data preprocessing (Auto Data Prep) is another well-known SPSS chip, a different way and a little more dreary implemented in RapidMiner.

In SPSS, data is collected by a single Automated Data Preparation node, ticked out what to do with the data. In RapidMiner - is assembled from atomic sites in an arbitrary sequence.

RapidMiner vs SAS

By the possibilities of “doing anything”, RM is higher, but, ultimately, with the help of some mother and some complications, you can get the same result in SAS. But here is a completely different approach - you have to relearn if you are used to SAS. SAS also provides many vertical solutions - banks, retail. The platform speaks to the user in his business language. RM is more abstract, it will have to articulate what is what.

RapidMiner vs Demantra

It’s not quite right to compare these two packages, but it’s important to illustrate how RM works. Oracle Demantra (and, very roughly, all similar products for a specific industry or task) is a complete package, tailored for specific procurement and supply tasks. There are specific operations there - they downloaded the sales data, received a forecast for the purchase of goods. One model, a lot of ready-made templates. Expensive, cool and big business.

On the other hand, in RM you can repeat the same thing, but half of the logic will have to be reinvented. This is very convenient for data scientists in terms of customization and flexibility of the final solution, but it is extremely difficult for business users - they simply will not see familiar words and tools.



So, we have a clear field for solving any problems. The most frequent in Russia, decided by such tools are:

And this is my (and not only my) favorite topic - metamodeling. For those who are a little apart from this - different models often find different relationships, forming different results on the same sample. And they often make mistakes in different places too. And it needs to be used - to make an ensemble of models (Model Ensemble). For example, the Vote operator (vote) takes into account the “opinions” of all the models included in the ensemble and the result is given, the result with the most “votes”. Or one of the most popular among “advanced” data scientists the Bagging method (Bootstrap Aggregation) is to train several models on different subsamples of initial data with the subsequent averaging of their results.


What can I say from the experience of several transitions to RapidMiner: it is important to note that from the point of view of Data Science, the impressions are positive. Technologically, a little bit worse - data cleansing is more difficult, we have become accustomed to the paradigm and simplicity of SPSS and SAS. Here it was necessary to rebuild the brain more - everything is done very differently. Very different architectural implementations, so I immediately say that it will be quite difficult to migrate independently in terms of the competence of specialists. Need to learn again. But for us and customers the result was worth it.

A lot of nice small chips. For example, it makes sense to say about "macros" - these are the parameters of the process that can be used at any of its points. For example, as a macro, you can use the file name, the date of its creation, the average value of any data attribute, the best achieved accuracy, the iteration number, the last time the process started. Often helps out when creating non-trivial processing operations. For example, using a macro can be limited to the time of the operation, while the time threshold is not fixed, but is a calculated parameter - it depends on the size of the data, the time of day (nighttime optimizations can take longer).

From the recent - a model was built to predict passenger traffic. Here we have already used RM 100%, because built everything “from scratch” and there was nowhere to look around; there was no need to transfer existing processes and try to repeat them on another tool.

What to do to get started

Take a fresh donor, that is, type any data, for example, about sales. If you don’t have your own, it’s not a problem, even a free starter comes with several demo kits. Try to look at your data through an accelerator for presetting processes. There are 4 ready-made processes, and they build handlers on the built-in model. Play with the data right in the GUI, see how cool it is. Experiment.

Here is a link to download the collected release from the official site.

If you have little data - just use it until you get tired, the company is well aware that only medium and large business buys their full version. If there is not enough data, it will be important for you to know that prices are fixed, do not depend on the customer.

If you feel that the thing is cool, but you want to learn quickly - come to our training center . We are an official partner of RapidMiner, and certificates are issued on the basis of the courses. You will have basic knowledge of statistics (at least to understand what emissions are, average value, normal distribution and dispersion) and basic knowledge of a computer. We will give our datasets from one German telecom, if you do not have your own (or bring it in an impersonal form too) and collect the case for forecasting customer churn. And then we will model the model based on how much money there is to hold them. For example, there are 10 thousand rubles and 100,000 customers - you need to choose from them those who are cheaper to keep, and who will bring more money to the company in the future. Get into the most likely client and maximize the final benefit (by the way, this is called Uplift Modeling or, if you are more accustomed to SAS terminology, Incremental Response Modeling).

And once again: the Starter version is full-fledged in terms of analytical functionality, and, therefore, your company can make a proof-of-concept for your company absolutely free.

Source: https://habr.com/ru/post/254467/

All Articles