Michael Stonebreaker - Hadoop at the Crossroads

[@tsafin - No need to introduce Turing Award Michael Stounbraker, he and his students from Berkeley and MIT have created, by sensation, most of the relational and non-relational databases over the past couple of decades. Ingres and Postgres, C-Store and Vertica, H-Store and VoltDB are just a few of the projects and companies that Michael and his students have directly influenced, and yet there are many forks and derivatives ...

So when he criticizes something, be it NoSQL or Hadoop, then the industry should at least listen, and it is better to try to change.

I found his point of view on Hadoop interesting, expressed in articles in 2012 and 2014, and it was interesting to follow the development of the “classic” point of view in such a short period of time.
')
The first article "Possible Hadoop Trajectories", published in "Comunications of ACM" http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories/fulltext , was written in May 2012 in co-authorship with Jeremy Kepner (Jeremy Kepner), who at that time worked as a senior technical staff at MIT , and as a researcher at the MIT Mathematics Department and the MIT Computer Science and AI Lab . This article, written in collaboration, seems to be more bold and perky, compared to the second, written by him two years later (and, well, the first article was written by IMHO in the best style), but I publish them in conjunction .to. the context has changed dramatically over the past couple of years, and it would be unfair to leave it unnoticed with the Hadoop / HDFS ecosystem.

By and large, Hadoop’s criticism in the first part relates only to the implementation of the MapReduce API, and after a few years, the Hadoop industry did a lot to solve the problems. But this, nevertheless, did not bring her much closer to solving the problems of HPC computing applications, mentioned in the first article.]

Possible Hadoop trajectories

Michael Stonebraker, Jeremy Körner, May 2, 2012

Over the past few years, Hadoop has become a parallel computing platform in Java. And as such, it has fully achieved the goal of bringing parallel computing to the practice of millions of programmers in the Java community. Note that all previous attempts to do this (Java Grandle, JavaHPC) did not succeed much, and we applaud Hadoop for the success achieved, for the most part, because of the simplicity and accessibility of the created environment.

Nevertheless, we see a lot of improvements that need to be made for its serious use in food installations, at least in the domain of scientific use in laboratories, such as Lincoln Lab, where one of us works. In most cases, using Hadoop in our environment is parallel computing (scientific analytic tools, information aggregation) and the deployment of data sets.

Let's take a closer look at these two usage scenarios.

Scientific calculations on Hadoop

Often, in a code that performs scientific calculations, nodes are organized into 2-D (or 3-D or ND) rectangular, partitioned grids (grids). And on each node we execute something like:

 Until final condition {
    Local computations on local data partition
    Issuing [changed] state
    Send / receive data to / from a subset of other nodes storing other data partitions
 }

This template describes most computational fluid dynamics (CFD) algorithms [computational fluid dynamics from continuum mechanics], all atmospheric and oceanic simulation models, linear algebra operations, operations on sparse graphs, image processing and signal processing. When considering this class of problems in Hadoop, we see the following tasks and problems that would be good to solve:

Local calculations always work with a large state at each iteration. Maintaining state between steps MapReduce requires writing to the file system, which is often very expensive. Also, often such algorithms require direct interaction between nodes, which is not supported by the MapReduce infrastructure. Often, similar algorithms assume a binding of the code to the same node of the grid, but on different iterations of the calculation of the algorithm.

Again, this model is not directly supported by MapReduce.

Our estimate is that MapReduce will only work for 5% of Lincoln Lab users. The remaining 95% should try to shove their algorithms into the Procrustean neck of the MapReduce model, paying as a result 1-2 times of slowing down the speed. Few scientists will agree on such a price, maybe only on very small sets.

Many may argue that in the long run, performance does not matter [@tsafin - WTF?]. This may be true only for low end [machines], but for data arrays, which we see both in Lincoln Lab and in other research laboratories of which we know, here productivity even matters, and there never happens enough computing resources. Moreover, for example, our organization invests $ 100 million in order to place the next-generation supercomputer center next to the hydroelectric dam and thereby reduce carbon emissions by an order of magnitude. And the loss of performance resulting from the introduction of Hadoop is an absolutely unacceptable additional cost. Even with smaller volumes, the use of such inefficient systems like Hadoop will be a very unfriendly step towards the environment and often just a loss of energy.

In short, we observed such steps when using Hadoop in a [scientific] computing environment:

Step 1. Try Hadoop in pilot projects;
Step 2. Scale Hadoop for grocery use;
Step 3. To lean against the wall due to the problems mentioned above;
Step 4. Modify the solution form to bypass the restrictions.

At Lincoln Lab, we have projects in each of the 4 states.

Survival of Hadoop in our environment will require a strong reworking of the model of using parallel computing, preferably supplemented by work in modern Hadoop to change the task scheduler. Our expectation is that the solution to these problems will make modern Hadoop unrecognizable in future systems.

We fully admit that in other offices a mixture of tasks relevant for its users may turn out to be more compatible with the MapReduce infrastructure. However, our feelings tell us that we are more the norm than the exception. Google’s move away from MapReduce to other models confirms such suspicions. Therefore, we expect dramatic changes in the Hadoop infrastructure.

Data Management in Hadoop

40 years of research and application of DBMS in industry confirm the thesis of Ted Codd, expressed as early as 1970: the programmer and the system are generally more effective in higher-level data manipulation operations in a high-level language, and [less efficient] if you have to work in languages manipulating records at the same time. Although Hadoop is quite high-level compared to write-for-time languages, it’s still easier to encode data requests using Hive rather than using MapReduce directly. Therefore, it seems to us possible to move all Hadoop data management tools towards higher-level languages, such as SQL and SQL-like languages.

For example, according to reports by David Devitt [1-1] , the Hadoop cluster on Facebook is almost entirely programmed in Hive, a high-level data access language that is very similar to SQL. In Lincoln Lab, there is a very similar trajectory of movement, although other high-level languages (not Hive) are chosen, which have rather high-level, algebraic interfaces for access to sparse data [ 1-2 , 1-3 ].

As such, MapReduce seems to be becoming an internal interface [encapsulated inside] a DBMS.
In other words, Hive users are not very worried about what is inside their HiveQL query, and the MapReduce interface becomes invisible and sinks in the depths of the internal DBMS. In the end, how many of you are experiencing what protocol is used when transferring over a parallel database network to interact between handlers on different nodes?

By the way, one of the authors of this article has created 5 parallel DBMSs, and, of course, is quite familiar with the communication protocols between the request coordinator and several executors at different nodes. And he knows that the nodes of the performers must interact with the rest to transfer the intermediate data among themselves. In such a scenario, to create a high-performance system, the following system characteristics will be required:

Implementing nodes should be able to save states so that data can be reused between iterations of the distributed query plan.
Need to maintain interoperability between nodes.
It should be possible to bind the processing of a request to data local to the node.

In general, a DBMS usually requires all the same set of conditions as the algorithms for scientific computing mentioned above. As a result, we obtain MapReduce as an internal interface in a parallel DBMS, but with an interface that is not very well suitable for a DBMS.

One of us wrote an article in 2009 comparing parallel database technologies with Hadoop [1-4] . Roughly speaking, the DBMS is faster than Hadoop by 1-2 orders of magnitude. This advantage comes from the use of indexes for data, from sending requests only to nodes where data lies (and not vice versa), from advantages in compression, and the best protocols between nodes. As far as we can tell, the situation in 2012 did not change much from 2009: Hadoop is still 1-2 orders of magnitude slower according to informal estimates. For example, in one large Web project there was a 5-petabyte Hadoop cluster deployed on 2,700 nodes, while the other, with a similar 5-petabyte installation, but managed by a commercial DBMS, there were only 200 nodes, which is 13 times smaller .

In short, at the moment, we see the following trajectory when managing data through Hadoop:

Step 1. Try Hadoop in pilot projects.
Step 2. Scale Hadoop for grocery use.
Step 3. Get unacceptable performance overhead.
Step 4. Change the solution form using a parallel DBMS.

Most Hadoop installations are now between steps 2 and 3, and the “lean against the wall” stage will be just the next step. Either Hadoop will grow into a real parallel DBMS for a limited amount of time, or users will switch to other solutions, replacing parts of Hadoop solutions, or using interfaces to Hadoop, supplying external data to it, or in some other way. Taking into account the small progress observed over the last 3 years, our rates are more likely on the 2nd decision [transition to other solutions].

And concluding, we can say: once the Gartner Group formulated its well-known “hype cycle” curve [1-5] [@tsafin - translated into Russian as “technology maturity curve”, sic], which can be used to describe the evolution of new technologies their very origin. The current state of affairs in the Hadoop ecosystem is more likely provided as “the best solution since the invention of bread and butter,” but we still hope that over time, the limitations we have mentioned will be lifted and we will get a little closer to what was promised.

Hadoop at the crossroads

Michael Stonebreaker, August 5, 2014

[@tsafin - the second article was written 2 years later, in August 2014, and published in the same journal “Communications of ACM” http://cacm.acm.org/blogs/blog-cacm/177467-hadoop-at- a-crossroads / fulltext ]

A lot of water has flowed from our previous, joint with Jeremy Kepner , article on this topic, written in 2012 [2-1] . I consider it necessary to point out several facts and opinions that have occurred, and also to report on a couple of important announcements. In summary, I’ll conclude the article with predictions of where the Hadoop stack might move in the future.

The first announcement, which is worth mentioning, is the release of a new DBMS, Cloudera Impala [2-2] , which runs on top of HDFS. Simply put, Impala was created, like all other shared-nothing parallel SQL DBMS [@tsafin - it’s not very clear what the well-established translation of the term SN will be, we suggest focusing on Sergey Kuznetsov ’s version “architecture without shared resources”], and is aimed at serving data warehouse market. Particularly noteworthy is the fact that the MapReduce layer has been removed, and removed deliberately. As many of us have pointed out over the years, MapReduce is not the best internal interface for SQL (or Hive) DBMS [ 2-3 , 2-4 ]. Impala was created by developers who knew about this fact. In fact, activity similar to Impala is already occurring both within HortonWorks and Facebook. This puts Hadoop vendors in a dilemma — historically, Hadoop was the open-source implementation of MapReduce, created by Yahoo. However, Impala threw this layer out of its solution stack.

How can someone remain a Hadoop vendor if Hadoop is no longer at the heart of his stack?

The answer is simple - you need to redefine the value of "Hadoop" and this is what Hadoop vendors eventually did. Hadoop now means the entire stack: at the bottom of HDFS, Impala, MapReduce or other systems are running at the top. Even higher level solutions such as Mahout can work on these systems. The concept of "Hadoop" now applies to the entire collection of the resulting solutions.

Another recent announcement was made by Google, who said that MapReduce is already a “past century”, and that they went further, and began to use solutions better, building their systems on systems like Dremmel, BigTable, and F1 / Spanner [2-5] . Google should be laughing like hell now: they invented MapReduce to support their crawler in their search engine in 2004, but a few years ago they replaced MapReduce with a BigTable implementation, because they needed an interactive storage system, and MapReduce worked only in batch mode. As a result, we find that the main driving force behind MapReduce abandoned it some time ago. And now Google reports that in the future MapReduce will not be required.

In fact, it is ironic enough that Hadoop chose to support this paradigm 5 years after Google abandoned it and went on. It turns out that the rest of the world followed Google to Hadoop with a moderate delay of about 10 years. Although Google has long refused this . I wonder how long the awareness of this fact will take from the rest of the world?

In the end, it turns out that Hadoop solution providers are now moving on overlapping courses with data warehouse providers. They are currently implementing (or have already implemented) essentially the same architecture as the data warehouse providers. As soon as they have a couple of years to strengthen the created implementations, they will be able to show adequate performance. Note that at the moment, most data storage providers already support HDFS, and many offer implementation of partially-structured data. Therefore, I am convinced that the data storage vendor market and the Hadoop vendor market will soon unite. And may the best system win in such a face-to-face street battle!

Now let's turn to HDFS, which has become one of the main building blocks of the Hadoop stack. Note that HDFS is a file system that is primarily capable of storing data bytes, which is quite natural to expect on any computing platform. There are 2 possible views on where HDFS might move in the future.

If you look at it through the eyes of the file system world, then users want a shared distributed file system, and from this point of view, HDFS is an ideal candidate.

From the point of view of parallel SQL / Hive in a DBMS, HDFS bears a "fate worse than death". DBMS always and everywhere want to send a request (a small number of kilobytes) to the data (many gigabytes) and never vice versa. Therefore, hiding the location of the data from the DBMS engine is similar to death, and the DBMS will always be very, very trying to circumvent such a restriction. All parallel DBMSs, from data storage providers to Hadoop vendors, will turn off location transparency, turning HDFS into a collection
Linux file systems, one file system per node.

Similarly, there is no database engine that would like to work with file system replicas. In [2-6] you can find a detailed discussion on this. In short, we obtain that for the distribution of workload, and the optimization of queries and transaction processing problems, we prefer replication systems used in DBMS for everything.

If over time it turns out that the point of view of database vendors will prevail in the market, then HDFS will be exhausted, because database vendors will stop using it. In their world, each node already has a local file system, parallel DBMSs support high-speed query language, and on top of all this there are a lot of tools and extensions defined through user functions. In this scenario, Hadoop turns into a standard database with a shared-nothing architecture with a number of alternative database vendors fighting for your dollar.

On the other hand, if the file system viewpoint prevails, then HDFS survives with a wide range of different tools running on top of the file system. Standard features for DBMS environments such as load balancing, auditing, resource management, data independence, data integrity, high availability, concurrency management, and quality service, all of these functions will go slowly to file system users. In such a scenario, there will be no high-level standard interfaces. In other words, the DBMS’s view of the world offers a large set of additional, useful services, and users will be warned in advance that they should be extremely careful when launching low-level interfaces.

In any of these scenarios, the file system remains the common part, and Hadoop vendors will sell either file system-based or database-based tools (well, or both). As a result, they will join the ranks of software vendors selling software or services. And let the best products win!