You understand Hadoop wrong

“We receive more than a million tweets per day, and our server simply does not have time to process them.” Therefore, we want to install the Hadoop cluster and distribute the processing.

It was a matter of computationally heavy sentimental analysis , so I could believe that one server really did not have enough CPU to cope with the large tweet stream.

- What are you going to do with the already processed data?
- Most likely, we will add them to MySQL, as they did before, or even delete them.
“Then you definitely don't need Hadoop.”

My former colleague was not the first to talk about distributed computing on Hadoop. And each time I saw a complete misunderstanding of why this platform was invented and developed.

Generally speaking, there are several typical situations in data processing, and for each of them there are different approaches and tools. Consider the main ones.

Accurate Active Data Processing

The most frequent scenario with which we all have to face is the storage of “active” data - information about users, a list of goods, comments to articles, etc. - everything that can change frequently. In this case, we want to be able to process the data pointwise - retrieve the desired object by index, process it and load it back. This is the functionality that most DBMSs provide (both relational and NoSQL).

When scaling the main problem here arises with the maximum amount of data stored in the database. However, even if the used DBMS does not support a distributed structure, the problem is easily solved by partitioning at the application level.

Real-time stream processing

Sometimes, however, the emphasis is not on storage, but on data processing. This is exactly the situation that my former colleague engaged in sentimental analysis faced. He needed to get real-time tweets, analyze them and display the result on a dynamically generated graph. A million daily tweets were not a problem - after all, this is no more than 140 million characters or 280 MB using the UTF16 encoding. The problem was analyzing these tweets in real time.

Fortunately, most streaming algorithms (whether sentimental analysis of tweets, collection of cumulative statistics, or online machine learning ) use small , independent pieces of data for their work. This makes it easy to parallelize processing by simply adding more compute nodes and placing a load balancer in front of them.
In the simplest case, a message broker (such as RabbitMQ or ZeroMQ ) can act as a balancer; in more complex, you can use ready-made frameworks for stream processing, such as Storm . At the same time, the main code that directly performs data processing remains practically unchanged compared to the single-server version.

Batch processing of historical data

In addition to active and streaming, there is also another important type of data - historical, i.e. those that were once generated and are unlikely to ever change. This includes event logs, financial indicators, document indices for a certain day, and indeed any data tied to a certain point in the past. Most often, such data is accumulated in large quantities and then used by analysts to solve business problems. A distinctive feature here is that at the time of processing the necessary data has already been collected and decomposed on multiple servers (if the data fit on one server, then they are simply not so large ).

Imagine that we have data on all purchases in a large supermarket chain over the past half year. We want to analyze this data: calculate the average efficiency of each supermarket, the effect of the shares held, the correlation between the purchased goods and many other metrics. How to organize work with data in such a way that the calculation of these parameters takes reasonable time?

We can load all data into a distributed Oracle database and work with them in the same way as with active ones. But in this case, the application server will consistently take data from the database and sequentially process each record, which is extremely inefficient.

We can also set up a pipeline for stream processing, distributing the load between application servers. But sooner or later we will rest on the communication channel between the processing nodes and data nodes.

The only way to distribute the load and not overflow the communication channel is to minimize the movement of data between nodes. And for this, it is necessary to produce maximum calculations locally on the machines where the processed data lie. It is the principle of data locality that underlies the MapReduce paradigm and the entire Hadoop.

More about MapReduce

It is considered that each MapReduce task consists of two phases - the map phase and the reduce phase. In fact, everything is somewhat more complicated: first, the data is split into splits, then the map function is performed on each of them, then the results are sorted, then combined, then sorted again, and finally the reduce function is passed. However, it is map and reduce that describe the main idea of the paradigm.

At the stage of applying the map function, all the work that can be done locally is performed. For example, locally, you can count the number of words in a line or calculate the metric of one entry in a CSV file, or analyze a tweet, etc. Map provides transparent parallelization and minimal load on the communication channel.

But one local map is not enough for most tasks: to calculate the number of words in the whole text, you need to add up their number in each line, and to calculate the average value of the metric, you need to put together the results for each entry from the CSV file. This is precisely the task of the reduce function. At the same time, the amount of data transmitted over the network is significantly reduced (the combination function, which acts as a local reduce, also helps a lot).

So why was my colleague wrong, intending to use Hadoop to sentimentally analyze tweets? Indeed, as mentioned above, in the map phase, you can analyze each tweet separately, completely ignoring the reduce phase! It's all about the infrastructure. First, tweets will still have to be first transported to compute nodes, and this means losing the advantage of data locality. Secondly, Hadoop is poorly suited for online processing: work is carried out with data packets, which means you have to first collect tweets, and only then start the MapReduce task. Even if you set up the map task for a constant and endless reading of tweets from the source, Hadoop will kill the entire task as a failed task after some time. Well, and thirdly, if you hear that Hadoop is fast, remember that performance is achieved by minimizing data movement, while MapReduce tasks themselves, especially on small amounts of data, can take quite a long time due to overhead (starting JVM, performing backup tasks, writing intermediate results to disk, etc.).

So use the right tools for the right tasks, and you will be happy!

Source: https://habr.com/ru/post/194314/

All Articles

You understand Hadoop wrong

Accurate Active Data Processing

Real-time stream processing

Batch processing of historical data

More about MapReduce

More articles: