📜 ⬆️ ⬇️

Hadoop: to be or not to be?

Hello, dear readers!

Some time ago we published a translation of the fundamental Oreil book about the Hadoop framework:


')
At present, the editors are faced with a difficult choice: whether to translate a new 4th edition of this book, or to reprint an existing one.

Therefore, we decided to publish a translation of the article by Ananda Krishnaswami, which appeared on the Thoughtworks blog back in 2013, where the author tries to analyze when it is appropriate to use Hadoop, and in which cases, unnecessarily.

We hope that the material will seem interesting, will cause controversy, and you will share your impressions about working with Hadoop and take part in the survey.



The Hadoop system is often positioned as a universal framework that will help your organization to deal decisively with any problems. One has only to mention "big data" or "analytics" - and immediately find the appropriate answer: "Hadoop!". However, the Hadoop framework was developed to solve a very specific class of problems; in other cases, to put it mildly, it is imperfect, and sometimes using Hadoop is an obvious mistake. While data transformation (in a broader sense, ETL operations — “extract, transform, load”) is greatly optimized using Hadoop, however, if your business has at least one of the five properties listed below, then you probably should not do Hadoop .

1. Thirst for big data

Many companies are inclined to believe that the data they have at their disposal are drawn to the status of “large”, but, unfortunately, in most cases, this estimate is too high. The research article "Nobody Ever Got Fired For Buying a Cluster" allows you to estimate how much data is generally believed to be "large." The authors conclude that the Hadoop system was created to process tera- and petabyte volumes of data, whereas in most practical tasks, the amount of input data does not exceed 100 GB (the median task size in Microsoft & Yahoo is less than 14 GB, and 90% of tasks in Facebook are much less than 100 GB). Accordingly, the authors of the article consider it expedient to allocate a separate server for episodic vertical scaling of the infrastructure, rather than horizontally scaling the infrastructure in which Hadoop operates, if necessary.

Ask yourself:

• Do we have several terabytes of data or more?
• Do I have a stable and very voluminous data flow?
• How much data am I going to operate on?

2. In line

When sending jobs, the minimum delay for Hadoop is about a minute. Thus, the system takes a minute or more to respond to a customer’s order and provide recommendations. Only a very loyal and patient client will look at the screen 60+ seconds, waiting for a response. Alternatively, you can pre-compute related elements for each element already in the list (a priori using Hadoop) and provide instant (second) access to the stored result on the website or mobile application. Hadoop is a great engine for such preliminary calculations, making it easy to work with big data. Of course, the more complex the typical response of this kind becomes, the less effective and complete prediction of the results.

Ask yourself:

• What are user expectations regarding program response speed?
• Which tasks can be bundled?

3. The answer to your call will come through ...

Hadoop is not intended for use in cases requiring real-time response to requests. Tasks going through the map-reduce cycle also spend some time in the shuffle cycle. The duration of both of these cycles is not limited, with the result that the development of real-time applications based on Hadoop is seriously complicated. Trading based on the volume-weighted average price is a practical example in which a quick system response is required to complete transactions.

Analysts can't do without SQL. Hadoop is not too good for random access to data sets (even with Hive, which actually generates MapReduce jobs from your query). Google's Dremel architecture (and, of course, BigQuery) is designed specifically to support spontaneous queries against giant rowsets for periods not exceeding a few seconds. SQL also allows you to make relationships between tables. Other promising alternatives include Shark from the University of California, AmpLab from Berkeley University, and the Stinger initiative, implemented by Hortonworks.

Ask yourself:

• How tightly should users / analysts interact with my data?
• Is interactivity required with terabytes of data, or with only a small subset of information?

So: Hadoop works in batch mode. This means that when adding new information, the task must re-sift the entire set of data. Consequently, the duration of the analysis increases. Data fragments — just updates or small changes — can come in real time. Often a business must make decisions based on these events. No matter how fast the new data is uploaded to the system, Hadoop will still process it in batch mode. Perhaps in the future this problem will be solved with the help of YARN. Twitter's Storm solution is already a popular and affordable alternative. The combination of a Storm with a distributed messaging system, such as Kafka, opens up a number of possibilities for stream aggregation and data processing. However, Storm lacks load balancing, while Yahoo has this possibility in S4.

Ask yourself:

• What is the “shelf life” of my data?
• How fast should my business generate revenue from incoming data?
• How important is it for my business to respond to changes or updates in real time?

Real-time advertising or tracking data from sensors requires real-time streaming input processing. But Hadoop or the tools based on it are not the only alternatives. For example, the HANA database from SAP, stored in RAM, was used in the ATLAS analytical toolkit of the McLaren team during the recent Indy 500 race with MATLAB to run models and respond to telemetry during the race. Many analysts believe that the future of Hadoop is related to interactivity and real-time work.

4. Just closed an account in your favorite social network

Hadoop and in particular MapReduce are best suited for working with data that can be decomposed into key-value pairs without risking a loss of context or any implicit relationships. Implicit relationships are in graphs (edges, subtrees, child and parent relationships, weights, etc.), and far from all such relationships can exist on a particular node. Therefore, most of the algorithms for working with graphs require the full or partial processing of the graph at each iteration. In MapReduce it is often impossible or very difficult to do. In addition, there is a problem with choosing a strategy for segmenting data by node. If your main data structure is a graph or network, then you probably should use a database for working with graphs, for example, Neo4J or Dex. You can also get acquainted with newer developments, such as Google's Pregel or Apache's Giraph.

Ask yourself:

• Can you say that the basic structure of my data is as important as the data itself?
• Is the required information related to the data structure no less or even more than the data itself?

5. Model MapReduce

Some tasks / tasks / algorithms simply do not fit into the programming model of MapReduce. One of these classes of problems has already been discussed above. Tasks in which to calculate the results you need to know the results of the intermediate stages of work — another category of such problems (an academic example is the calculation of Fibonacci series). Some machine learning algorithms (for example, based on gradient descent or maximizing expectations) also do not quite fit into the MapReduce paradigm. There are certain optimization strategies / options (global state, transfer of data structures for reference, etc.) suggested by various researchers to solve each of these problems, but their implementation still remains more complex and non-intuitive than we would like.

Ask yourself:

• Does the company give serious attention to highly specific algorithms or domain-specific processes?
• Will the technical department cope with analytics better if the applied algorithms are adapted to MapReduce - or not?

In addition, we should consider such practical cases in which the data set is not too large, or the total amount of data is large, but this set consists of billions of small files (for example, you want to view the mass of image files and select those that contain a certain figure), not amenable to concatenation. As mentioned above, if the tasks do not fit into the MapReduce “divide and aggregate” paradigm, then using Hadoop to solve such problems is a dubious idea.

So, having studied when Hadoop may not be the best solution, let's discuss when it is appropriate to use it.

Ask yourself:

Is your organization going ...
1. Extract information from huge volumes of text logs?
2. Convert mostly unstructured or poorly structured data into a convenient organized format?
3. Solve tasks related to processing the entire set of data, perform operations at night (just as day transactions with cards in credit companies are handled)?
4. Rely on conclusions made after a single data processing, valid up to the next scheduled data processing (which is not applicable, for example, to the values ​​of stock quotes that change much more often than at the close of the trading day)?

In such cases, you almost certainly should pay attention to Hadoop.

There are a number of business problems that fit well with the Hadoop model (although practice shows that solving even such problems is a rather nontrivial matter). As a rule, such tasks are reduced to processing huge amounts of unstructured or semi-structured data and consist either in summarizing the content or converting the observations made into a structured form for later use by other components of the system. In such cases, the Hadoop model is very helpful. If there are elements in the data you collect that can easily act as identifiers for the corresponding values ​​(in Hadoop, such pairs are called “key-value”), then such a simple association can be used for several aggregation options at once.

So, the most important thing is to clearly understand what business resources are available and to understand the problem you are going to solve. I hope that these considerations, as well as the recommendations outlined above, will help you find exactly the tools that are perfect for your business.

It is likely that this will be Hadoop.

Source: https://habr.com/ru/post/257233/


All Articles