We harness R for business service at "1-2-3"

This post is, in fact, a summary, summarizing the previous "technological" publications [ 1 , 2 , 3 , 4 , 5 ] and the discussions and discussions that have arisen. The latter have shown that the tasks in which the application of R could be of great assistance to business is very, very much. However, even in those cases when R is used, modern capabilities of R. are not always used for this.

The niche for applying R in business is open and very relevant both in the west and in Russia.

Why is this statement particularly interesting for Russia?

1. An active process of import substitution is underway, controlled at the level of ministries and government. Many will prefer to use software from the allowed list, rather than inventing justifications and proving the need to acquire foreign software, albeit slightly better than their Russian counterparts. And to state-owned companies attention to the order more closely.

2. Unfortunately, there are no reports that the crisis has ended and recovery begins. Tightening the belt is yes. So budgets for expensive toys and not very much in sight. At the same time, nobody cancels the solution of problems; the number and ambitiousness of tasks rather only increases.

3. Business rightly believes that if IT does not create competitive advantages, then at least it should help to quickly prepare the information field for manual or automated decision making. At the same time, requests from business are often quite prosaic and unpretentious in order to attract Nobel laureates to solve them.

In fact, for confident navigation in the “digital” sea, business only needs local “stitching” of the information space and the formation of interactive presentations to simplify the decision-making process in the context of a very limited set of questions and processes.

In general, the decision making situation can be described as follows:

In every company, every employee makes many important business decisions every day.
Time for decision making is small (seconds-hours).
The question is not always formulated in a clear and unambiguous form.
Decision making may require complex mathematical processing of data.

Technologically, this process is described by the chain “Collection - Processing - Modeling and Analysis - Visualization / Unloading”.

The localization of the “stitching” leads to the fact that the use of powerful industrial ETL \ BI \ BigData solutions is completely unjustified from a technical or economic point of view.

In order to plant a bed of carrots, do not plow a dozen hectares of land.

On the other hand, such a context is very comfortable for the R ecosystem and is performed at times. For business, the “1-2-3” approach can be summarized as follows (the business likes pictures):

When using R, it makes almost no difference to technologically what data sources and formats are there, how clean they are, what to draw and how to draw. You can almost everything. The main thing is to have a formulated business problem.

Back to the practical example

As a demonstration of the applicability of the above approach, let us turn again to the topic mentioned earlier in the post “Data Science Tools as an Alternative to the Classical Integration of IT Systems” , namely, the example of the agronomist console in one of the subtasks of the modern direction “Precision Agriculture”.

The subtask itself sounds quite prosaic: “ Optimize field irrigation with regard to the characteristics of the cultivated crop, phenological phases and climatic conditions (past, present, forecast) to improve the quality of the crop and reduce costs .”

Naturally, the IT analytics subsystem is just one of the subsystems. The full complex covers the tasks of choosing the optimal method and direct measurement of the physical indicators of soil moisture (which is difficult in itself) and environmental parameters, autonomous sensor operation and telemetry transmission over the radio channel taking into account the field scales (units-tens of kilometers), low cost + compactness + work without changing the battery throughout the season, optimizing the placement of sensors and protecting them from various influences, including the increased interest of local residents, as well as taking into account the water balance in plants (rough, absorption - evaporation). But all these tasks are beyond the scope of this publication.

So, console agronomist. Everything is done on R + Shiny + DeployR. An example of a working version of the console is shown in the following screenshot:

Everything looks simple and trivially smooth, until you get into the details. Namely, the proposed approach to local data stitching is manifested in the details.

1. There is no global repository with a rigid data model containing all of the information. On the contrary, there is a set of autonomous or semi-autonomous subsystems containing a subset of information in its own form.

2. Since the agronomist console and the information displayed in it is needed when there is someone to watch, the application itself acts as an infinite loop for the message manager. The console is dynamic, checking the need for recalculation is done on a timer, the elements are updated automatically, using reactive elements of the Shiny platform. At the same time, the autonomous operational analytics lives on the R server in a mode independent of the console.

3. Current weather. The data is taken from several sources, including web sources (REST API) and data from actual sensors on the field (log \ csv + git). Since all sensors on the field save batteries and get in touch in their own mode, the data in the console comes in asynchronous mode. The git repository was used as a field data store.

4. For on-line analysis, the interface contains, among other things, elements that control the displayed slices. The entire recalculation takes place upon changing the settings.

5. GIS-map with installed sensors on the field. A multi-layered map (exactly here as the OpenStreet substrate) with a superimposed infrastructure of field sensors, and dynamically recalculated indicators of these sensors, such as: current status, current readings, time of the last reading. Meta-information about sensors is obtained from the cloud accounting system for IoT equipment. Due to the rather complex internal logical structure of the IoT platform objects, in order to obtain data about the sensors, it is necessary to perform a chain of 3-4 REST API requests with intermediate processing.

6. Tabular representation for the output of event information: indications, logs, recommendations, problems, forecasts. Each type of output information is obtained either from a separate source (connection, collection, parsing, preprocessing), or is the result of the work of the mat. algorithms (for example, forecasts and recommendations).

7. The data area (on the right) combines information obtained and processed from a dozen different sources into a single console:

Data on historical indicators of weather. Data from field sensors (txt + git) and weather data from open web sources are used. Due to the fact that free accounts (after a preliminary analysis of several web sources) far from have a deep history, and the idea of paying $ 100-150 a month for accessing the weather data of agricultural producers is not encouraging, a separate process of accumulating historical web sites was raised data based on monitoring of current (REST API -> txt + git). And, of course, in case of conflict of data from different sources, it is necessary to resolve it. As one of the main sources, we stopped at the Open Weather Map - OWM
The forecast part also caused a number of questions. Different sources give different information with different granularities. Not all sources give a forecast of precipitation in mm. If given, then not all give hourly. Can issue units. They need to somehow reduce.
In particular, when requesting precipitation, OWM produces a 3-hour unit in mm, starting from the moment of fixation. If we talk about the past, then the moment of fixation can also be issued random. Thus, we get an arbitrary time series with a 3-hour aggregates and a large number of repetitions, for which it is necessary to restore the hourly picture.
The data from the sensors come through various channels. The sensors themselves “live” in asynchronous mode (battery ecology), so the data from them comes in stream mode, without the possibility of forced polling. The lack of guarantee of communication channels (everything in the field, sometimes in a bad coverage area) and different versions of the sensor hardware platforms make it necessary for the analysis to collect data from all potential stores. Currently, sensor data is coming in to git (structured view and logs) and to the cloud-based IoT device control platform.
The data from the sensors (and in the field they are not 2 and not 5) undergo preliminary mathematical processing. Due to the specifics of measuring soil moisture and the impossibility of direct measurements (with a certain reservation for NMR or radiometric methods), the result of indirect measurements strongly depends on the structural properties of the soil. It is necessary to determine the reliability of the readings of each of the sensors, based both on its particular calibration curves, and on historical data, expected indicators, data on irrigation carried out and information from other sensors on the field.

Conclusion

In the west, the community R, as well as the circle of tasks, is developing exponentially. Open-source is actively coming. To get acquainted with the novelties in part R, you can use the R-bloggers aggregator as a launching pad. For example, a very interesting fresh business post: “Using R to detect fraud at 1 million transactions per second” .

In Russia, there are all prerequisites for using R in business problems, but so far a relatively weak community. On the other hand, the active and inquisitive audience Habra is the best conductor of modern IT technologies in our country.

It's time to try to solve the problems existing in your companies in a new way, with the use of new tools, and start sharing your experiences. A discussion of questions and incomprehensible moments in open discussions will only contribute to this.

PS By the way, now the semantics of the dplyr package is dplyr available for working with Apach Spark. Sparkly package came out providing this transparency.

Previous post: "You do not have enough speed R? We are looking for hidden reserves"
Next post: "Using R for preparing and transmitting live analytics to other business units"

Source: https://habr.com/ru/post/311330/

All Articles

We harness R for business service at "1-2-3"

Back to the practical example

Conclusion

More articles: