At the moment, we can already assume that the passion for Big Data and Data Science has subsided a bit, and the expectation of a miracle, as usual, has been greatly corrected by the reality of the physical world. It's time to do a constructive activity. The search for topics on Habré by various keywords produced an extremely scarce collection of articles, so I decided to share the experience that had been gained in the practical application of Data Science tools and approaches for solving everyday tasks in the company.
What is the connection between routine and integration tasks?
- Throughout the whole working day, both ordinary users and managers of various ranks, IT is used only as a means to make decisions and execute a set of ritual actions in most cases in a business process.
- Users are surrounded by several piecewise-integrated information systems and for decision-making, you need to look at the “ten” sources, “tune” the data a bit, think something, and use Excel to match the level of familiarity with MS Office and mathematics.
')
- For a reaction that is more complicated than “filling in 5 screens and clicking Next-Next-Next”, you need to “click on the cap” to launch a mini-project for a week or two, to make corrective actions.
The classic approach to automate such tasks is to attract business process consultants; formation of proposals for the transition to a single platform with global integration; analysis and selection; RFI / RFP; tenders; long-term implementation; some result for a lot of money on a morally obsolete platform during the implementation.
Of course, I’m exaggerating a bit, but even time and money spent on endless group meetings while working on a solution cost tens of millions of rubles in payroll (payroll), and many initiators are already working somewhere else by the end of the project.
Paradoxically, in fact, for an acceptable satisfaction of initial needs, it was enough just to promptly perform local “stitching” of data, their processing and easy-to-visualization. At the same time, when you switch to the language of analogies of the real world and speak in terms of repair and construction of a house, everything is obvious to everyone and no one offers to build a new house immediately because of the wallpaper torn by a cat.
Therefore, we decided to use the tools that exist in the Data Science community to solve such problems. The minimum set that completely satisfied us is the
R language, IDE is
RStudio , the integration gateway is
DeployR , and the client web application server is
Shiny . When we talk about visualization, then naturally, this is not PieCharts, but modern ergonomic principles of presenting information, including interactive JS elements.
It is important that at the initial stage all products are used in the open-source or community edition format. If it suddenly turns out that the task is solved super-successfully and it is necessary to expand and speed it up, then each component has a commercial version at a very low cost, eliminating the large-scale limitations of free products.
Where about Big Data?
Solving practical problems, we once again convinced that the Big Data world is extremely limited and demanded, mainly by large IT or network companies. The initial interpretation of the term Big Data as a volume of data that does not fit into the framework of computer RAM, taking into account the development of computing tools, loses its meaning for ordinary tasks. You can put 16 Gb into the laptop, ~ 500 Gb into the server, and in the cloud you can even order a server with 2Tb DDR4 RAM + 4 Tb SSD (Amazon EC2 X1).
For convenience of designation within the workflow of such data, which seems to be large, but still less than the amount of RAM of the computer, we have adopted the term
Compact Data .
So, in real problems of ordinary companies
Compact Data is enough to make decisions with the required accuracy and speed.
For your information, colleagues from Google generally translate the conversation from spatial to temporal dimensions:
“For me, the term Big Data does not refer to the size of the data. It's about transforming the hours of hard work in analyzing data into easy processing within seconds, ” - Felipe Hoffa, Google software engineer.
R success stories
As a first success-story, we brought out another BI system this week. Quite unexpectedly, it turned out that the management was dissatisfied with reporting from the systems that are currently available. Therefore, over the course of six months, the survey, analysis and even the pilot of the BI system from among the finalists were conducted. The contract for the supply and implementation was already on the table by the management. At the last moment, we put our foot in the door and asked for 3-4 days to make an alternative stand based on the R tools. During these 5 days, we managed to repeat in the rough all the poor BI pilot functionality (partly on synthetic data from third-party IS), add a lot of additional analytics on dashboards, reveal a couple of holes in the unit's performance, fasten the predictive analytics. Accordingly, after a week, the contract with BI lay where it was supposed to (the trash can), and we received a blank check for implementation. Six months later, the project was frozen in development, as having reached the height of the desires of management and users. Along the way, while developing we slowed down another tender for the expansion of the existing system (and, for a second, it was almost ~ $ 400k) and did everything that was necessary for the business themselves.
The following Data Science case appeared in the context of the fashionable task of “smart farming”, namely the control of watering plants. The simple question “How many liters to pour?” Will raise a whole bunch of tasks. These are calibration and data collection from a variety of sensors, which are not regularly collected and extremely inaccurate (for example, it does not work out for measuring soil moisture) and optimizing the geolocation of these sensors and building a weighted weather forecast using free data and a complex physical and mathematical exchange model. water plant depending on current conditions. And everything must also be clearly and intelligibly interactively displayed on the agronomist’s computer. After about 3 months of work, the prototype was assembled. And everything is done on the above-mentioned tools R + bash.
What is more attractive than R in contrast to various mouse-hoppers?
- This is a complete programming language. The latest Hadley Wickham packages have raised R's usability data to near-space orbit. Functional programming support is also actively expanding.
- A wide range of mathematical packages and algorithms.
- Elementally embedded in devops. Sources in git, there is an autotest mechanism, the ability to self-document ( R Markdown ). Collaboration and application of agile methodologies.
- StackOverflow community.
- ... and many other goodies.
findings
Active work is still underway on pairs of tasks, but the experience gained as a whole allows us to confidently take on almost any task in the task of local “stitching”. In general, the feeling of the capabilities of R by ordinary users can be described with such a picture:
If you summarize the experience, then such a "stitching" is in demand almost everywhere. The main thing is to look with a fresh look (we read the literature on TRIZ and inventions that were late for a hundred or two years), and the management should not be afraid to take a chance. The fundamental thesis at the start of such activity is the promotion of small steps.
In the ideal case, as a result of the work, a small component appears that:
- collects data from all necessary sources and conducts behind-the-scenes complex processing;
- gives the user a beautiful interactive picture (wow effect is desirable, but not an end in itself);
- in addition to the picture, it gives a detailed interactive report and recommendations on the choice of the optimal solution;
- as far as possible, independently perform the required changes in other information systems (the beginnings of operational analytics).
The scope of work is deliberately limited to 2 months maximum. Iterative and interactive development, as a rule, allows for this term to conceptually solve local problems in the gap of various IS. After completion of the work, it is necessary to “drive” the resulting component in real business processes, compare the effect with the expected one. If there are still tasks or new ones, then prioritize and begin a new iteration.
It is important that each iteration:
- based on real business needs;
- brings real business effect;
- finished and self-sufficient.
At the same time there are no overhead costs for heavy project management, the task is observable in scale, the documentation is created only the minimum necessary.
Once again, it is unlikely that Data Science will be discussed as complex mathematical algorithms in the application of Big Data. The real tasks of a business are much more prosaic, but the benefits of solving them can be very, very large. R tools and Data Science approaches can be a great help.
The most remarkable thing is that intrigue is preserved to the last. You never know in advance what the next step and the next request will be after, and competent hands and a bright head can not only help correct current shortcomings, but also suggest new business opportunities.
Next post:
“Ecosystem R as a tool for automating business tasks”