📜 ⬆️ ⬇️

Data Mining Hub, through the eyes of scientists

Hi, Habr!

We have launched Data Mining Hub and want to tell you what it is and why it may be useful to you.

Data Mining Hub (DMH) is a platform for developing data mining algorithms (Data Mining) and Machine Learning (Machine Learning), which is based on an iterative approach, as well as a business tool that helps analyze large amounts of data and extracts from This data is useful and necessary information.
')
The difference of DMH from similar resources, such as kaggle and algomost:


There are two sides to DMH. The first is the customer who describes the task, and the second is the scientist who is trying to solve this problem.

DMH scientists provide an opportunity to take part in solving interesting problems, compete with other participants and, of course, get paid if their algorithm was chosen by the Customer. If he was not selected at this iteration, then he can always be selected at the next. DMH will automatically transfer the results from the last iteration to the new one if the original data has not changed. But it is also possible to improve your algorithm and get paid for by the improved algorithm in the next iteration.

For a customer, DMH is a single point of integration with a large number of scientists and an easy way to use different algorithms on the same data.

Briefly, the principle of DMH can be described as follows:


Anyone can go to www.datamininghub.com/invite/me and ask the DMH to invite them by simply entering an email.

Consider what a scientist needs to do in order to take part in the solution of an assigned task. In principle, everything is quite simple. He needs to choose a task, create an algorithm for it, test it on the original data. If a satisfactory result is obtained, then further it is possible to indicate the cost of using the algorithm.

Consider all the details



After authenticating to datamininghub.com, a page will open, listing all the tasks that need to be solved. You need to select your favorite task and download the source data in the Data Set section



Next, you need to develop an algorithm using any development tools. The main thing is that the algorithm be a jar file (or several such), which could be run as a job on hadoop.

A small example of the algorithm on Scala is available at the link: github.com/datamininghub/example-algorithm

A real example of solving an existing problem available at www.datamininghub.com/task/1 or on the same Scala is available here: github.com/datamininghub/example-bill-status-prediction

In order to upload your algorithm you need:
  1. Login to DMH.
  2. In the menu, select Algorithms , after which a page will open listing all the created algorithms for this user.
  3. Click on add new algorithm


  4. If a AWS account was not linked to a user profile earlier, the system will ask you to do this at this stage:



    If the AWS account is missing, you will need to register it.
    Following the link http://aws.amazon.com/free/ it is possible to register a new account and use free limits for a year.
    After that you will need to follow the link Sign up for Amazon S3 - Find my keys and create the keys that need to be further entered into the DMH.

  5. After the AWS account is mapped, the Algorithm details page will appear, which will reflect the default name of the algorithm DataMiningHub algorithm N for Hadoop 1.0.3 and where you will need to click on Edit :


  6. On the Algorithm edit page that appears, it is possible to change the name of the algorithm to something else, change the version of the used Hadoop. Then you need to click on Add step to add a step, which is the addition of a jar file containing the code of the algorithm, and determine the arguments with which this file will be launched:


  7. On the Add file page that appears, select the jar file to upload and click the Upload button or specify S3 link to this file.



    For example, a file named bill-status-prediction.jar is taken.
    Note: file upload may take some time!
  8. Now you need to specify the arguments on the Step algorithm edit page with which this jar file will be launched, and click the Save button:



    For example, the following arguments are used: -o {output} --events {events} --bill_deputy {bill_deputy} -f

  9. After the arguments have been set, the Algorithm edit page will reappear, but with information about the step already entered. If necessary, you can download other jar files, just click Add step and repeat steps 6 through 8.
  10. Now on the Algorithm details page, you need to click bet on the navigation bar to determine the cost of using the algorithm and perform the calculations:


  11. On the Algorithm bet page you need to select the task in which the algorithm will be used:



    In this example, only one iteration is available. Prediction if a bill becomes the law in the future.
  12. On the page that appears Add new bet use algorithm% algorithm_name%, you need to determine the cost of using the algorithm and click the bet it button:


  13. On the Edit calculation page that appears, in the Mappings section, you need to make a mapping of the names of all arguments from all steps ( steps ) with source data, click on assign next to each argument name and select the desired data source, and click calculate :



    If necessary, you can save this calculation by clicking the Save button.
  14. After all the manipulations, the Calculation details page will appear, on which the status of this calculation will be displayed. After the calculation is completed, its result will be sent to the postal address associated with this profile.

    Example of calculation during processing:



    An example of a complete calculation:


  15. When the calculation is completed, its result will appear in the task description, as well as the cost of using the algorithm, and the Customer will be able to choose this algorithm as a solution to the task:





It is possible to test the functionality of the algorithm on any data before setting the cost of using this algorithm by clicking on try it in the navigation panel on the Algorithm details page. The edit calculations page will appear, in the Mappings section of which you will need to load the data for calculations and click on calculate in the navigation bar.

ps - special thanks to Eugenia for the invaluable contribution to this text!

Source: https://habr.com/ru/post/236581/


All Articles