Since the last publication, I had to try on a number of different tasks related in one way or another to data processing. The tasks are completely different, but in all cases the R tools allowed them to be elegantly and effectively solved. Below, in fact, cases (no pictures).
The formulation of the project problem is simple. It is necessary to conduct an expert analysis of documents that are available only on requests on the web portal and for an initial report related to the preparation of an analytical report. It sounds simple if not the following nuances:
A group of "classical" analysts rolled up their sleeves and went into overtime reading and manual document processing. Neither the time nor the desire to convince of the inferiority of this approach was not.
We went the other way, namely, web-scraping by R. means. There are a lot of materials on web-scraping on the Internet and on Habré, I won’t repeat. An example of such a publication in Habré: "Web Scraping using python . "
In the classical approach, Perl \ Python is used for such tasks, but we decided not to mix the tools, but to use R under combat conditions. The problem was solved more than successfully in such finite metrics:
Despite the fact that the documents were dynamically generated (JS), we managed to find a loophole for getting the 1st page, and there were all the necessary attributes, and thus do without Selenium \ PahntomJS. This greatly accelerated the data collection process.
Moreover, since web requests work in asynchronous mode, the task was parallelized (by means of R). A time of ~ 8 hours is indicated for execution on 4 cores, so it would be 3 times longer.
The group of "classical" analysts continues to work hard ...
library(lubridate) library(dplyr) library(tidyr) library(tibble) library(readr) library(stringi) library(stringr) library(jsonlite) library(magrittr) library(curl) library(httr) library(xml2) library(rvest) library(iterators) library(foreach) library(doParallel) library(future)
Task setting is also very simple.
There is a technological line for the production of raw materials of the product. During the technological process, the raw material is subjected to various physical processing, namely, press, molding, chemical. processing, heating ... Metrics on individual processes are measured at different intervals, and in different ways. Somewhere automatically, somewhere, manually, for example, laboratory control of raw materials. Aggregate metrics ~ 50-60 pcs. The duration of the process is 1.5-2 hours.
The task is to get the output with the specified physical parameters.
Difficulties arise when we fall below:
It is necessary to correctly set the line parameters immediately from the start of production of the order and adjust them as required. process for leveling possible fluctuations in the production process.
What they had at the entrance:
The task, difficult for usual means, means R in 2 steps:
The obtained accuracy of forecasting the parameters of output in the amount of 3-5% turned out to be quite sufficient and adequate to the standards.
For justice, it should be noted that this was all done in the “proof-of-concept” mode. After demonstrating the solvability of the “unsolvable” problem and a little more detailed study of the solution, the emphasis shifted to the area of ​​improving the procedures for measuring technical parameters. This is exactly the case when the operational application of R has allowed to clear the mind from secondary digital tasks and return to the primary problems of the physical world.
library(dplyr) library(lubridate) library(ggplot2) library(tidyr) library(magrittr) library(purrr) library(stringi) library(stringr) library(tibble) library(readxl) library(iterators) library(foreach) library(doParallel) library(zoo) library(randomForest)
Formulation of the problem:
In aggregate, there may be several hundred thousand \ million records for all objects to be analyzed.
It is necessary:
Estimated solutions before starting R-based PoCs:
Yes, a large-scale task is simply because there is a lot of data and reports, not for 2 days.
For proof of feasibility, we spent a proof-of-concept week based on R.
As part of the PoC covered:
In general, the pilot’s results confirmed the technical feasibility of all requirements; outlined development potential; estimates of labor costs for implementation with the use of R. were obtained. Estimates are lower by an order of magnitude or more compared to SAP. Opportunities for automation and configuration many times exceed all other solutions.
library(dplyr) library(lubridate) library(ggplot2) library(tidyr) library(magrittr) library(purrr) library(stringi) library(stringr) library(tibble) library(readxl) library(iterators) library(foreach) library(doParallel) library(zoo) library(gtable) library(grid) library(gridExtra) library(RColorBrewer) library(futile.logger)
Facts are facts. You can believe in R, you can not believe; you can doubt the possibilities; one can check a hundred times and wait until they are told from the stands that it is necessary to act this way ... But those who gain courage and patience and replace the old data processing approaches with the R platform will receive multiple savings and advantages here and now.
Previous post: “Using R for preparing and transmitting live analytics to other business units”
Source: https://habr.com/ru/post/315870/
All Articles