📜 ⬆️ ⬇️

Big Data vs Data Mining

Recently, very often, both within the team and outside it, I often come across a different interpretation of the concepts of “Big Data” and “Data Mining”. Because of this, there is a growing misunderstanding between the Contractor and the Customer regarding the proposed technologies and the desired result for both parties.
The situation is aggravated by the lack of clear definitions from some generally accepted standardizer, as well as a different order of the cost of work in the eyes of a potential buyer.

There is an opinion on the market that “Data mining” is when the Contractor was dumped, he found a couple of trends there, generated a report and got his million rubles. With “Big Data” everything is much more interesting. People think that this is something of black magic, and magic is expensive.

The objectives of this article are to prove the absence of significant differences between the interpretation of these concepts, as well as clarification of the main dark spots in the understanding of the subject.

What is Big Data


This is what Wikipedia gives us at ru.wikipedia.org/wiki/Big_Data :
')
Big data (English big data) in information technology - a series of approaches, tools and methods for processing structured and unstructured data of huge volumes and considerable diversity for obtaining human-perceptible results that are effective in conditions of continuous growth, distributed over numerous nodes of a computer network formed at the end 2000s, alternative to traditional database management systems and business intelligence solutions.

What do we see? The definition, which should be determined by a certain object (a big bicycle, a small tree, a scooter, etc.), actually defines a certain set of methods and goals, in fact, determining a certain range of processes. Can we agree with this definition, with the assumption that jogging (process) can be called a kettle (subject)? Difficult to say, try to decompose the definition.

Big Data is:


In these components of the definition it is not clear what is:


The tasks solved by Big Data methods include:


Unstructured data

This is what Wikipedia gives us at en.wikipedia.org/wiki/Unstructured_data :

Unstructured Data (or unstructured information) refers to information in the pre-defined manner. Unstructured information is typically text-based, but may contain data such as dates, numbers, and data.

In other words, they are trying to tell us that there is data without a structure ... And they give the most deadly example of such data - the text. Interestingly, my teacher in Russian language and literature would say if she found out that the Russian language / text does not have a clear structure and, as a result, the years of its study are meaningless, as we learn rules that some people say do not exist.

To understand my point of view, I will give an example - the text field in Postgres. Suppose I put JSON there. For version 8 it will be just text (unstructured data), for 9 it will already be JSON (structured data). That is, the same data is both structured and unstructured at the same time? Again unimaginable dualism with lectures on physics? The answer is simple - unstructured data does not exist, just for some types of data there are no generally accepted and at the same time widespread methods of working with this data.

A literate reader, of course, exclaims - what about video data? Any video is a collection of frames. Any frame is an image. Images are of two types:


Calling unstructured vector images is extremely difficult. Here you can read at least about the SVG format, which is essentially XML. Raster images are in fact an array of points, each of which is described by a quite clear data structure.

Total - unstructured data does not exist.

Huge size

Here I have no discrepancies with public opinion. As soon as problems start to come with data (difficult to accept, difficult to store, difficult to process, etc.), you have a huge size (of data). The concept is rather subjective, for me the huge size is measured in pieces. For me, the lower bound of Big Data is a million records. Justification - requests to the DBMS with complexity of type Θ (n2) on such a volume are performed a few minutes, which is long for me.

For other people, the rationale / criterion may be different, therefore the lower limit of a huge size will be different.

What is Data Mining?


This is what Wikipedia gives us at ru.wikipedia.org/wiki/Data_mining :

Data Mining (Russian data mining, data mining, in-depth data analysis) is a collective name used to denote a set of methods for detecting previously unknown, non-trivial, practically useful and accessible interpretation of knowledge necessary for decision-making in various spheres of human activity. The term was introduced by Grigory Pyatetsky-Shapiro in 1989.

Translated into a simple language - you already have a certain data array that has already been somehow processed before, and now you process this data array again, perhaps somehow differently from before, and you get some useful conclusions that you further use to get profit
It turns out that according to the definition of Wikipedia, the “Data Mining” decomposition includes:


The tasks solved by Data Mining methods include:


findings


According to the definitions given above, the data mining “wins” Big Data at the expense of a democratic approach to the volume of data.

According to the list of tasks solved using the Big Data and Data Mining methods, Big Data already “wins” because it solves the problems of data collection and storage.

Thus, if we consider that in principle it is not advisable to investigate small amounts of data, the meaning of the concept Data Mining is fully included in the meaning of the concept Big Data. Consequently, those who say that this task is just “Data Mining” and not the magic “Big Data” say something like this - “This is not a bird, this is just a dove”, which is not true from the point of view of formal logic, which we all so respect.

As for the price, in both areas of knowledge regarding overlapping tasks an identical stack of technologies, tools and methods is used. As a result, the price of work must also be of the same order.

In conclusion, it makes sense to add that many try to compare these concepts between themselves and other concepts (for example, with the highload task, as the author did here: habrahabr.ru/company/beeline/blog/218669 ) on the stack of software. For example, if we use RDBMS, then this is already 100% not Big Data.

I cannot agree with this point of view, because modern RDBMSs operate with impressive amounts of data and allow you to store almost any type of data inside yourself, which, if properly indexed, quickly aggregates and is output to the application level, and you can write your own indexing mechanism.

In general, it is wrong to classify a class of tasks by the stack of software and hardware, as any unique task requires a unique approach that includes those tools that are most effective for solving this particular task.

Source: https://habr.com/ru/post/267827/


All Articles