⬆️ ⬇️

Glitter and poverty Big Data

The revolution connected with the analysis of big data creates not only remarkable achievements, but also certain difficulties, including methodological ones. Consider some of them in more detail.



Speaking about the analysis of big data [1], it is often not realized that behind this sign lies several completely different in nature tasks. Here we will touch only a few. First, these are tasks of the type of complete, rather than partial, decoding of the genome of each particular person. It is clear that the solution of such problems gives rise to revolutions in the relevant areas.



For example, total decoding of genomes - in medicine. Maybe the course of these revolutions is not as fast as Steve Jobs hoped, but, nevertheless, these revolutions are inevitable. Another type of tasks, where the processing of samples is replaced by the famous formula BIG DATA (N = ALL), is associated with the processing of all the data of the same type, for example, for the purpose of forecasting.



Here, the revolutionary results obtained, although to some extent preserved, but somewhat fade. For example, if instead of selective surveys of the Gallup institute on the eve of the election of the American president, to conduct a total survey of all US voters, the accuracy of the forecast will increase, but probably only slightly. The third type of task is of particular interest. This is a total analysis of semi-structured data. The simplest variant of such weak structuring can be fragmentary structuring. We illustrate the fragmentary structuredness by the example of data containing the results of psychological research on some specific topics based on all the different types of surveys on it contained in the worldwide network. The problem of knowledge extraction that arises here is of fundamental nature and therefore it is necessary to dwell on it especially.

')

As is well known, modern science, originally Western, and now world science, arose on the basis of awareness of new intelligible entities - tables of the “object-feature” type [2]. Analysis of semistructured data, in order to extract knowledge, is not directly reduced to the analysis of such tables. However, given the fact that in the foreseeable future, the creation of a new fundamental science is not foreseen, the only way out of this situation is one way or another of reducing such non-tabular data to a tabular form. This, of course, is to some extent realized by theorists of BIG DATA and finds expression in their key thesis “the more data, the less their accuracy”. Thus, BIG DATA draws a huge panorama, but it is possible to see this panorama only as if through steamed glass. In other words, there is a certain informational analogue of the Heisenberg uncertainty relation. The optimistic claims of specialists that the big data revolution will replace the establishment of causality by simply counting correlations is doubly wrong.



First, science, strictly speaking, never set itself the task of answering the question “why,” that is, causality, being content, on the basis of “the laws of nature,” with the statement “if so, then so”, that is, essentially correlations.



Secondly, correlations, estimated even by all data, no matter how colossal N would be, may reflect the real connection poorly, due to the inevitably weak accuracy of the data. This raises two problems. The first is to minimize losses in accuracy when aggregating semi-structured data and the second is to increase the efficiency of extracting knowledge from these inaccurate data.



Let's start with the first problem. Due to the very nature of the “object-feature” tables, first of all, it is necessary to carry out a certain rubrication of data, and each rubric will be correlated with a specific “object”. The content of these headings may have a different volume and character, but each of them must have an inherent quality that allows data to be attributed to this “object”. It often turns out that in addition to this primary heading and primary objects, it is necessary to introduce secondary internal headings and, accordingly, secondary objects. In order not to complicate the presentation here, we omit the often arising need for synthesis based on several secondary objects, construction objects. We illustrate what has been said in the aforementioned example of psychological research. Here, the primary rubrics will contain the data of individual studies - surveys; separate completed questionnaires will appear as secondary objects. We now turn to the issue of signs. In our opinion, much here is determined by the specifics of the field to which the data relate and the task facing the researcher. So, for example, in our psychological research, it is usually necessary to build some integral features of primary objects, some averaging of which over all these objects or some of their clusters turns out to be the required “knowledge”.



As for the second problem, although large N are a factor facilitating reliable decision making, but the inaccuracy of data in many tasks, the growing pace at many leading N indicates the need to develop decision-making algorithms more powerful than DATA MINING can offer today.



In our opinion, it is this third type of tasks associated with weakly structured big data that should be essentially called BIG DATA, since it is here that it becomes necessary to create a qualitatively new data processing, and not just use the increasing power of computers.



LITERATURE



1. Victor Mayer-Schönberger, Kenneth Kukier. Big data. A revolution that will change the way we live, work and think, Mann, Ivanov and Ferber Publishing House, 2013.

2. Michel Foucault. Words and things. Archeology of the humanities, M.: Progress, 1977.>

Source: https://habr.com/ru/post/280474/



All Articles