Big Data technologies were created as an answer to the question "how to process a
lot of data." What to do if the amount of information is not the only problem? In industry and other serious applications, one often has to deal with big data of a complex and variable structure, scattered arrays of information. There are problems, the method of solving which is not known in advance, and the analyst needs the means to study the initial data or the results of calculations based on them without the involvement of a programmer. We need tools that combine the functional power of BI systems (and better, surpass it) with the ability to process huge amounts of information.
One way to get such a tool is to create a
logical data mart . In this article we will talk about the concept of this solution, as well as demonstrate the software prototype.
For the story we need a simple example of a complex task. Consider a kind of industrial complex with a huge amount of equipment, hung with various sensors and sensors, regularly reporting information about its condition. For simplicity, we consider only two units, the boiler and the tank, and three sensors: the temperature of the boiler and the tank, as well as the pressure in the boiler. These sensors are monitored by automated process control systems of different manufacturers and provide information to different storages: information about the temperature and pressure in the boiler goes to HBase, and data about the temperature in the tank is written to log files located in HDFS. The following diagram illustrates the data collection process.
')

In addition to specific sensor readings, for analysis it is necessary to have a list of sensors and devices on which they are installed. Let us estimate the order of the number of information entities with which we would deal in a real enterprise:
Essence | Number of entries | Storage Type |
---|
Equipment units | Thousands | Master data |
Sensors, Sensors | Hundreds of thousands | PostgreSQL DB |
Sensor readings | Tens of billions per year (the question of the depth of storage in this article is not set) | Files in HDFS, HBase |
Storage methods for data of different types depend on their size, structure and the required access mode. In this case, we chose just such means to create a "raznoboy", but in real enterprises most often there is no opportunity to freely choose them - it all depends on the established IT landscape. The analytical system needs to assemble the entire "zoo" under the same roof.
Suppose we want to provide the analyst the opportunity to make queries of this type:
- What units of oil-filled equipment worked at temperatures above 300 degrees in the last week?
- What equipment is in a state outside the operating range?
In advance to build and program all such requests will not work. Performing any of these requires binding data from various sources, including those outside of our model example. Outside, for example, reference information on the working ranges of temperature and pressure for different types of equipment, faceted classifiers to determine which equipment is oil-filled, etc. can be received. The analyst formulates all such queries in terms of the
conceptual model of the subject area , that is, exactly expressions in which he thinks about the work of his enterprise. For the presentation of conceptual models in electronic form, there is a stack of semantic technologies: the OWL language, triple storage, the SPARQL query language. Not being able to talk in detail about them in this article, we refer to a
Russian-language source .
So, our analyst will formulate queries in terms familiar to him, and receive in return data sets - no matter from what source this data is extracted. Consider an example of a simple query that can be answered in our set of information. Let the analyst be interested in the
equipment installed on which the sensors simultaneously measured the temperature more than 400
0 and the
pressure more than 5 MPa for a specified period of time. In this phrase, we have highlighted in bold words corresponding to the entities of the information model: equipment, sensor, measurement. The attributes and relationships of these entities are in italics. Our request can be represented in the form of such a graph (under each data type we indicated the storage in which they are located):

When looking at this graph, the query execution scheme becomes clear. First, it is necessary to filter temperature measurements for a specified period with a value greater than 400
° C, and pressure measurements with a value greater than 5 MPa; then you need to find among them those that are made by sensors, installed on the same piece of equipment, and at the same time are executed simultaneously. This is exactly what the data mart will do.
The scheme of our system will be as follows:

The order of the system is as follows:
- the analyst makes a request;
- logical data showcase represents it as a query to the graph;
- A showcase determines where the data is located to answer this query;
- showcase performs private requests for source data to different sources, using the necessary filters;
- receives answers and integrates them into a single temporary graph;
- performs post-processing of the graph, which consists, for example, in applying the rules of inference;
- executes the original request on it, and returns the answer to the analyst.
All "magic", of course, occurs inside the display case. We list the principal points.
1. In the Apache Jena triplet storage (you can use any other), we store both the domain model itself and the mapping settings for data sources. Thus, through the editor of the information model, we set up a set of terms in which requests are built (device, sensor, etc.), and service information about where to get the actual information from them. The following image shows how the demo model model class tree (on the left) looks in our ontology editor, and one of the forms for setting the data mapping with the source (on the right).

2. In our example, data of the same type (temperature measurements) are stored simultaneously in two different sources - HBase and the HDFS text file. However, to execute the above query, you do not need to access the file, since There is certainly no useful information in it: after all, measurements of the temperature of the tank are stored in the file, and the pressure in the tanks is not measured. This moment gives an idea of ​​how the query optimizer should work.
3. A data showcase not only compiles and links information from various sources, but also draws logical conclusions on it in accordance with specified rules. Automation of logical inferences is one of the main practical advantages of semantics. In our example, using the rules, the problem of obtaining conclusions about the
state of the device based on
measurement data was solved. Temperature and pressure are contained in two different entities of the type "Measurement", and to describe the state of the device it is necessary to combine them. Logical rules apply to the contents of the time result graph, and generate new information in it that was not available in the sources.
4. Data sources can be not only storages, but also services. In our example, we hid behind the service the calculation of the prerequisites for the occurrence of an emergency condition using one of the Spark MLlib algorithms. This service receives information about the state of the device at the input, and evaluates it in terms of the presence of prerequisites for the accident (retrospective data about what conditions preceded the actual accident occurred are used for training; you should consider not only the instantaneous values ​​of the physical characteristics of the device, but also basic data elements — for example, the degree of wear).
This feature is very important, as it allows the analyst to perform the launch of the calculated modules prepared by programmers, passing them to the input data arrays. In this case, the analyst will no longer work with the original data, but with the results of calculations based on them.
5. The analyst builds queries using the interfaces of our Knowledge Management System, among which there are several variants of the formal query designer, and a search interface in controlled natural language. The following figure on the left shows the form for building a query in a controlled language, and on the right is an example of the results of another query.

Of course, in any barrel of honey there is a fly in the ointment. In this case, it consists in the fact that the reduced architecture will not work very fast on really big data. On the other hand, when the analyst works in the “free search” mode for solving problems, the speed for him is usually not fundamental; in any case, the showcase will produce results much faster than the programmer, to whom, in its absence, the analyst will have to turn for manual execution of each of his queries.
Outside our story, there are a lot of interesting points:
- how sensor data collection is organized in HBase using Flume;
- requests to data sources can be executed not just asynchronously, but even in the absence of an online connection with them - in this case there is a special mechanism for sending a request and receiving a response;
- The results of the query can not only be issued to the user as a table or unloaded into Excel, but also get directly into the BI system as a data set for further analysis;
- ways of converting identifiers and references to objects in different sources, issues of message transport between system components, and much more.
Of course, real industrial applications of the data mart are much more complicated than the example described. Our goal in this article was to demonstrate that the use of semantic technologies and conceptual modeling in tandem with the means of Big Data allows us to expand the number of available data analytics to use data, solve applied problems in the most extraordinary conditions. In conjunction with the ability to
control access rights to the triplet repository and the execution of triggers on it , which we have already described, the described tools allow us to meet very sophisticated functional requirements.