📜 ⬆️ ⬇️

Big Data - why is it so fashionable?

Big Data technologies are very popular today, as evidenced by the fact that at the moment it is the most frequently used term in IT publications. Just look at the statistics of such well-known search engines like Google or Yandex by the phrase “Big Data”, and it becomes clear that the so-called “Big Data” really can now be called one of the most popular and interesting areas of information technology development.

So what is the secret of the popularity of these technologies and what does the term Big Data mean?
First of all, a huge amount of information is hidden under the terms “Big Data” or “Big Data”. Moreover, its volume is so large that it becomes extremely difficult to process such amount of data with ordinary software and hardware. In other words, Big Data is a problem. The problem of storing and processing huge amounts of data.

Where do these volumes come from? Let's think about how much information each day generates. We talk on the phone, write messages, write blogs, buy something, photograph something, send it to our friends, receive something in response, etc. etc. In the end, we are talking about gigabytes of information. All this leaves its mark in the information space. All this is somewhere stored and somehow processed. At some point, the information becomes too much, and it becomes too difficult to extract from it.
')
On the other hand, a large amount of information is only part of the iceberg. It is appropriate to recall such a definition of “Big Data” as “Volume, Velocity, Variety”, which on the one hand means huge amounts of data (which we have already mentioned), and on the other, the need to work with information very quickly. For example, the operation time to check the balance on your card when withdrawing cash is calculated in milliseconds. These are the requirements dictated by the market. The third side of the issue is the diversity and unstructured information. Increasingly, it is necessary to operate with media content, blog entries, poorly structured documents, etc.

Thus, when we talk about Big Data, we understand that this is due to three aspects: a large amount of information, its diversity, or the need to process data very quickly.

On the other hand, this term is often understood as a very specific set of approaches and technologies designed to solve these problems. One of these approaches is based on a distributed computing system, where not one high-performance machine is used for data processing, but a whole group of such machines combined into a cluster.

There are several approaches to building systems that provide distributed data processing. One of the most popular approaches is the use of the MapReduce paradigm, according to which data processing is divided into a large number of elementary tasks performed on various nodes of the cluster and, ultimately, reduced to a single result. This model was developed by Google and allows petabytes of data to be processed using computer clusters.

Now the MapReduce paradigm uses a fairly large number of different projects, and interest in distributed computing technologies is getting higher and higher every day. Among the existing implementations of the described model, you can highlight the Hadoop project, which is currently managed by the Apache Software Foundation.

The Hadoop project has been developing since 2005 and is used all over the world, for example, in such giants as Amazon, Google, Facebook, and is rapidly gaining popularity in Russia as well. It is worth noting that Hadoop is not the only implementation of the MapReduce paradigm. In Wikipedia, you can find links to no less than 15 projects, one way or another using this approach.

But what is the reason for the popularity and what exactly does the use of such an approach? The main advantage of distributed systems is their ability to unlimited increase in performance by linear scaling. In addition, we must not forget that a high-performance cluster can be built on low-end machines, which means its cost will be significantly lower than the cost of a server of similar capacity. And the third, important, I think the moment is the reliability and fault tolerance of the system. Due to the fact that the cluster consists of a large number of nodes, and the system automatically redistributes the data stored on them, when a single or several machines are disabled, the risks of losing any information are minimized.

Wrong on my part would be to ignore such a technology area as NoSQL database. In my opinion, this is the second significant trend that is often associated with Big Data technologies.

And this happens for a reason. NoSQL databases have begun to grow rapidly, since processing large amounts of data using traditionally popular relational databases is becoming an increasingly complex and resource-intensive task. At the same time, the cost of the hardware required to solve such problems on relational databases is increasingly questioning the effectiveness of this approach in terms of the necessary investments.

The term NoSQL itself implies the use of approaches that are different from our usual relational database management systems based on ACID principles. The methodological basis of NoSQL databases is based on the principle that it is impossible to simultaneously ensure data consistency, accessibility and resistance to splitting into isolated parts in a distributed system.
In view of this, such databases do not focus on ensuring consistency of data in favor of high availability and robustness to data separation. Most of them are organized according to the “key-value” principle, due to which high flexibility and speed of information extraction are achieved. On the other hand, by virtue of its inertia, NoSQL databases, as a rule, are not used for processing rapidly changing data. But when it comes to the need to quickly get a small piece of information from a huge amount of data, such a solution would be a godsend.

It is necessary to pay special attention to the fact that such DBMS can be both distributed and unallocated. As an example of the distributed databases of the NoSQL class, such databases as Cassandra, HBase, MongoDB, CouchDB, Riak, Scalaris or Voldemort can be cited. As for the unallocated databases, then, in my opinion, the use of unallocated NoSQL DBMS is a half measure. We will be able to get an additional performance gain due to scaling in a distributed system. Although it cannot be denied that these solutions have their own tasks.

It remains for us to understand who may be interested in solutions at all, for processing large amounts of data? Naturally, interest is generated in those industries where there are these volumes, and there are already difficulties in their processing. Examples of the use of such systems can be Internet search engines, social networks, online auctions, etc. Outside the Internet environment, these can be banks and telecommunications companies. They, according to the experience of the experts of the DIS Group, are experiencing the greatest difficulties in processing the huge amounts of data accumulated over decades.
Let's summarize: what can a corporate user attract to Big Data technologies, and what can distributed computing technology offer businesses?

First of all, it offers a high-performance system debugged in such giants of the Internet industry as Google, Yahoo and Facebook, capable of operating with terabyte-sized data in real time. Thanks to these solutions, the problems of time consuming to build data warehouses, processing poorly structured information and high hardware costs will be a thing of the past, and businesses will be able to fully use all the necessary data accumulated in the company.

On the other hand, due to the use of inexpensive equipment during the construction of the system, these solutions allow redistributing financing and directing cash flow to address the immediate needs of the business, rather than to maintain the infrastructure of the organization itself.

Source: https://habr.com/ru/post/160219/


All Articles