📜 ⬆️ ⬇️

More examples of using R to solve practical business problems

Since the last publication, I had to try on a number of different tasks related in one way or another to data processing. The tasks are completely different, but in all cases the R tools allowed them to be elegantly and effectively solved. Below, in fact, cases (no pictures).


Case # 1: Web Scrapping


The formulation of the project problem is simple. It is necessary to conduct an expert analysis of documents that are available only on requests on the web portal and for an initial report related to the preparation of an analytical report. It sounds simple if not the following nuances:


  1. Document database is not available directly.
  2. There is no API for the portal.
  3. Documents on a direct link can not be addressed only through the mechanism for creating a session connection and search.
  4. The full text of the document is not available, its display is formed by JS scripts for page-by-page paging.
  5. For each of the relevant documents it is necessary, based on the plain text of the 1st page, to make an abstract summary of its attributes.
  6. Documents ~ 100 thousand, really related to the task ~ 0.5%, time for the first release ~ 1 week.

A group of "classical" analysts rolled up their sleeves and went into overtime reading and manual document processing. Neither the time nor the desire to convince of the inferiority of this approach was not.


We went the other way, namely, web-scraping by R. means. There are a lot of materials on web-scraping on the Internet and on Habré, I won’t repeat. An example of such a publication in Habré: "Web Scraping using python . "


In the classical approach, Perl \ Python is used for such tasks, but we decided not to mix the tools, but to use R under combat conditions. The problem was solved more than successfully in such finite metrics:


  1. 0.5 days to analyze the portal (structure, requests, responses, navigation) by means of developers built into Chrome \ Firefox.
  2. 1 day to develop R scripts for preparing lists of documents and their subsequent attribute enrichment.
  3. ~ 8 hours for the formation of tasks, automated data collection and the formation of a table view.
  4. ~ 2 days for manual review of the list, cutting off unnecessary and reading potentially relevant documents. After reading, only a couple of dozen documents were found to be inappropriate.

Despite the fact that the documents were dynamically generated (JS), we managed to find a loophole for getting the 1st page, and there were all the necessary attributes, and thus do without Selenium \ PahntomJS. This greatly accelerated the data collection process.


Moreover, since web requests work in asynchronous mode, the task was parallelized (by means of R). A time of ~ 8 hours is indicated for execution on 4 cores, so it would be 3 times longer.


The group of "classical" analysts continues to work hard ...


Technical details


  1. To parse the pages and select the required content, various combinations of XPath + regexp techniques were used.
  2. The whole code was ~ 200 lines of which 50% are comments and long line breaks to follow code conventions.
  3. Explicitly connected set of packages:

library(lubridate) library(dplyr) library(tidyr) library(tibble) library(readr) library(stringi) library(stringr) library(jsonlite) library(magrittr) library(curl) library(httr) library(xml2) library(rvest) library(iterators) library(foreach) library(doParallel) library(future) 

Case # 2: Predicting the quality of factory products


Task setting is also very simple.


There is a technological line for the production of raw materials of the product. During the technological process, the raw material is subjected to various physical processing, namely, press, molding, chemical. processing, heating ... Metrics on individual processes are measured at different intervals, and in different ways. Somewhere automatically, somewhere, manually, for example, laboratory control of raw materials. Aggregate metrics ~ 50-60 pcs. The duration of the process is 1.5-2 hours.


The task is to get the output with the specified physical parameters.


Difficulties arise when we fall below:


  1. There is no exact analytical model of technical processes.
  2. Parameters of raw materials vary from load to load.
  3. A customized production model is supported when the parameters of the output product must change in accordance with the current order.
  4. "Experiments" on the restructuring of the parameters of the technological line on the principle of feedback turn the business into a far too expensive price. In 1.5-2 hours of a full cycle before the output of products, not one ton of raw materials will be transferred. If the line parameters are set incorrectly, there will be a defect, the order must be redone.

It is necessary to correctly set the line parameters immediately from the start of production of the order and adjust them as required. process for leveling possible fluctuations in the production process.


What they had at the entrance:


  1. Description of the process.
  2. Excel "sheets" with elements of VBA automation, containing the values ​​of the parameters when releasing an order. Different shifts and different factories - a slightly different style of filling.

The task, difficult for usual means, means R in 2 steps:


  1. Adaptive excel data import into a structured view.
  2. Using the RandomForest algorithm to build a predictive model.
  3. Reduction of the parameter list based on the results of RandomForest runs and general physical considerations on the process.

The obtained accuracy of forecasting the parameters of output in the amount of 3-5% turned out to be quite sufficient and adequate to the standards.


For justice, it should be noted that this was all done in the “proof-of-concept” mode. After demonstrating the solvability of the “unsolvable” problem and a little more detailed study of the solution, the emphasis shifted to the area of ​​improving the procedures for measuring technical parameters. This is exactly the case when the operational application of R has allowed to clear the mind from secondary digital tasks and return to the primary problems of the physical world.


Technical details


  1. The entire code was ~ 150 lines of which 30% are comments and long line breaks to follow code conventions.
  2. Explicitly connected set of packages:

 library(dplyr) library(lubridate) library(ggplot2) library(tidyr) library(magrittr) library(purrr) library(stringi) library(stringr) library(tibble) library(readxl) library(iterators) library(foreach) library(doParallel) library(zoo) library(randomForest) 

Case # 3: Consolidated analytics and verification in a heterogeneous productive environment


Formulation of the problem:


  1. There are hundreds of downloads from SAP in Excel format across hierarchical objects. Formats of unloading focused on the visual perception of man and extremely uncomfortable for the perception of the machine.
  2. There is Excel unloading from third-party systems + data of manual input \ digitization.
  3. There is data in Access databases.

In aggregate, there may be several hundred thousand \ million records for all objects to be analyzed.


It is necessary:


  1. Perform data reconciliation as part of a separate report on custom multiparameter rules.
  2. Cross-reconcile data from various data sources according to custom multiparameter rules. The results of discrepancies issued to analysts.
  3. Prepare summary analytical reports monthly based on this entire array of information. Hundreds of pages with different graphical and table views.
  4. Provide analytics tools for quick arbitrary manipulation of data according to previously unknown rules.

Estimated solutions before starting R-based PoCs:



Yes, a large-scale task is simply because there is a lot of data and reports, not for 2 days.
For proof of feasibility, we spent a proof-of-concept week based on R.


As part of the PoC covered:


  1. Import excel data with support for "floating" data format
  2. Import access data.
  3. Implementation of a subset of checks (numeric, text, combined).
  4. Generation of graphical \ tabular views.
  5. Generate html \ doc view with graphics, text, tables.
  6. Interface analyst to work with data

In general, the pilot’s results confirmed the technical feasibility of all requirements; outlined development potential; estimates of labor costs for implementation with the use of R. were obtained. Estimates are lower by an order of magnitude or more compared to SAP. Opportunities for automation and configuration many times exceed all other solutions.


Technical Details PoC


  1. The entire code is ~ 500 lines of which 20% are comments and long line breaks to follow code conventions.
  2. Import time for arrays is seconds.
  3. Execution time checks - seconds.
  4. The time of generating output reports (html \ doc) is the fraction of minutes.
  5. Explicitly connected set of packages:
     library(dplyr) library(lubridate) library(ggplot2) library(tidyr) library(magrittr) library(purrr) library(stringi) library(stringr) library(tibble) library(readxl) library(iterators) library(foreach) library(doParallel) library(zoo) library(gtable) library(grid) library(gridExtra) library(RColorBrewer) library(futile.logger) 

Conclusion


Facts are facts. You can believe in R, you can not believe; you can doubt the possibilities; one can check a hundred times and wait until they are told from the stands that it is necessary to act this way ... But those who gain courage and patience and replace the old data processing approaches with the R platform will receive multiple savings and advantages here and now.


Previous post: “Using R for preparing and transmitting live analytics to other business units”


')

Source: https://habr.com/ru/post/315870/


All Articles