Should I pay for Apache Hadoop?

In 2010, Apache Hadoop, MapReduce and its associated technologies led to the spread of a new phenomenon in the field of information technology, called “big data” or “Big Data”. Understanding what the Apache Hadoop platform is, why it is needed and why it can be used to penetrate the minds of experts all over the world. Founded as a one-person idea and quickly growing to an industrial scale, Apache Hadoop has become one of the most widely discussed platforms for distributed computing, as well as a platform for storing unstructured or weakly structured information. In this article, I would like to dwell in more detail on the Apache Hadoop platform itself and consider the commercial implementations provided by third-party companies and their differences from the freely distributed version of Apache Hadoop.

Before proceeding with the consideration of commercial implementations, I would like to dwell on the history of the origin and development of the Apache Hadoop platform. The creators and ideological inspirers of Apache Hadoop are Doug Cutting and Michael Cafarella, who in 2002 began to develop a search engine in the framework of the Nutch project in part-time mode. In 2004, the MapReduce distributed computing paradigm developed at that time was attached to the Nutch project, as well as a distributed file system. At the same time, Yahoo was thinking about developing a distributed computing platform for indexing and searching pages on the Internet. In fact, Yahoo’s competitors didn’t sleep and also thought about this issue - this has led to the fact that there are quite a few platforms for distributed computing, for example, Google App Engine, Appistry CloudIQ, Heroku, Pervasive DataRush, etc. Yahoo rightly decided that the development and support of its own, proprietary distributed computing platform would cost more and the quality of the result would be lower than investing in an open source platform. Therefore, they began to search for suitable solutions among open source projects, reasoning reasonably that the harsh and unbiased opinion of the community will enable them to improve the quality and at the same time reduce the cost of maintaining the platform, since its development will be dealt not only with Yahoo, but with the entire free IT community . Pretty quickly, they came across Nutch, who at that time stood out against the competition from the already confirmed work results and decided to invest in the development of this project. To this end, in 2006, they invited Doug Cutting to lead a dedicated team of a new project called e14, whose goal was to develop a distributed computing infrastructure. In the same year, Apache Hadoop is issued as a separate open source project.
')
We can say that from the decision to invest in the development of open source platform for distributed computing, Yahoo has received quite tangible benefits, just as Apache Hadoop received an impetus to develop with the help of a huge corporation. Apache Hadoop has helped Yahoo bring global scientists to the company and create an advanced research and development center, which is now one of the leading centers for search, advertising, spam detection, personalization, and many other things related to the Internet. Yahoo didn’t have to develop many things from scratch, he took advantage of third-party developers, for example, he used Apache HBase and Apache Hive to solve his problems. Since Apache Hadoop is an open platform, now Yahoo does not need to train specialists, it can be found on the labor market for such people who already have experience with Hadoop. If Yahoo decided to develop its own platform, it would have to train specialists within the company to work with it. Apache Hadoop is to some extent an industry standard, and the development of this platform is carried out by many companies and third-party developers - so Yahoo saved on a constant investment in the development of this platform and got rid of the problem of constant software obsolescence. All this allowed Yahoo to launch Yahoo WebSearch in 2008, using an Apache Hadoop cluster of 4,000 machines.
However, the development of Apache Hadoop in Yahoo was not cloudless throughout its journey. So, in September 2009, Doug Cutting, not finding a common language with the leadership of Yahoo, goes to the California startup Cloudera, which is engaged in the commercial development and promotion of Apache Hadoop in the market for big data. Honestly, I have no information on what exactly their views did not agree on, but the fact remains - having been offended by Doug Cutting’s decision, Yahoo gives money in 2011 to create a company called Hortonworks, whose main activity is also commercialization and promotion of the platform. Apache Hadoop. It is about these companies that will be discussed further in this article. I will try to compare the two solutions for distributed computing supplied by these companies, and also try to figure out why Apache Hadoop has to be paid.

Cloudera Inc.

In October 2008, three engineers from Google, Facebook and Yahoo and one manager from Oracle created a new company, Cloudera, in America. They bet on distributed computing systems based on MPP architecture. It is reasonable to reason that the amount of data in the world that needs to be analyzed is growing every day, and the number of companies needing tools for such data analysis will constantly grow, they relied on the fact that with the creation of a company that will have a sufficient level and qualified in this field, they will be able to earn quite a lot. Since they did not have their own product, and there was no time to develop it, they decided to take some open source project and build their business around it. Apache Hadoop fits perfectly for several reasons - they all knew him, worked with them and understood that this project has great potential - so in March 2009, Cloudera announced the so-called Cloudera's Distribution including Apache Hadoop, abbreviated CDH, which is a distribution kit. Apache Hadoop (HDFS, MapReduce, Hadoop Common), which includes a number of related programs and libraries, such as Apache Flume, Apache Hive, Hue, Apache Mahout, Apache Oozie, Apache Pig, Apache Sqoop, Apache Whirr and Apache Zookeeper.

However, the distribution kit consisting of the assembly of open source libraries and programs cannot be sold to anyone, so it was decided to develop our own software for Apache Hadoop. The creators of Hadoop, Doug Cutting and Michael Cafarella, were brought into the company. It was decided to develop a tool for deploying, monitoring and managing the Apache Hadoop - Cloudera Manager cluster. This tool automates the process of deploying an Apache Hadoop cluster, provides opportunities for real-time monitoring of current activities and the status of individual nodes, composes heatmaps, can generate messages for certain events, manages user access, stores historical information about cluster usage, collects logs from nodes and allows you to view them.

All this has allowed Cloudera to launch a service package called Cloudera Enterprise, consisting of three products:

Where

CDH is an Apache Hadoop distribution (HDFS, MapReduce and MapReduce2, Hadoop Common), which includes a number of related programs and libraries, such as Apache Flume, Apache Hive, Hue, Apache Mahout, Apache Oozie, Apache Pig, Apache Sqoop, Apache Whirr and Apache Zookeeper.

Cloudera Manager is a tool for deploying, monitoring and managing an Apache Hadoop cluster.

Cloudera Support - professional support provided by Cloudera experts on issues related to CDH and Cloudera Manager.

All this is sold as a subscription and is quite expensive - for example, Cloudera Manager costs $ 4000 per node. Despite this, for some companies this solution is reasonable, since Apache Hadoop has a high cost of support and administration. In particular, for writing MapReduce tasks, a staff of qualified Java specialists is needed, the cost of which in the labor market is quite high. However, a limited number of companies use the services of Cloudera - everyone tries to do it on their own. This is due to the fact that, in fact, the only Cloudera proprietary development is the Cloudera Manager, and even that costs much more than it can. In my opinion, the Cloudera Enterprise package is currently not worth the money, since essentially the only useful thing provided by Cloudera within this package is the Cloudera Manager. With all the rest, a sufficiently qualified specialist, if he has time, can figure it out on his own. The main advantage that Cloudera currently uses is a limited number of Apache Hadoop specialists in the world, which allows Cloudera to speculate on the market for providing technical expertise on Apache Hadoop.

Anyway, on May 23, 2012, Apache Hadoop 2.0.0 Alpha was available for download from hadoop.apache.org , and on June 5, 2012, Cloudera announced with great fanfare the fourth version of CDH, which is the first in the world to support Apache Hadoop 2.0.0 Alpha codebase. According to most, the version of Apache Hadoop 2.0.0 Alpha is "raw" and unstable in operation, and some companies prefer to wait until the stabilization period passes, during which most of the errors will be corrected. Despite this, Apache Hadoop 2.0.0 has some advantages over the first version, the main ones being the following:

High Availability for NameNode
YARN / MapReduce2
HDFS Federation

As I wrote earlier, it all works while it is extremely unstable and is not recommended to install in a productive environment. However, the pioneer laurels do not allow Cloudera to sleep peacefully, which prompted them to release Apache Hadoop 2.0, the first in the world, on CDH4. Thus, Cloudera declared its leadership in providing a platform for distributed computing, because no one else has a distribution based on Apache Hadoop 2.0. What does the main competitor of Cloudera in this area offer - Hortonworks?

Hortonworks

The emergence of Cloudera has led many people to think about the market prospects in this area, there were many who wanted to become leading leaders who set the main vector of development and, accordingly, have the most complete expertise and qualifications in this area. Therefore, in 2011, Hortonworks was founded, the founders of which were engineers, mostly from Yahoo, who were able to attract funding from Yahoo and the investment fund Benchmark Capital. The company did the same thing that Cloudera did - commercializing Apache Hadoop. Most recently, on June 12, 2012, one day before the Hadoop Summit 2012, Hortonworks announced its Hortonworks Data Platform , or HDP for its distributed Apache Hadoop 1.0 based platform. The architecture of this platform is shown in the picture below:

To be brief, this platform provides all the same as Cloudera CDH4, but only on the basis of the Apache Hadoop 1.0 codebase. There is one small difference - as part of HDP, Hortonworks supplies the Hortonworks Management Center (HMC) based on the Apache Ambari, which performs the same functions as the Cloudera Manager, but is completely free, which is an obvious advantage, since Cloudera Manager For unknown reasons, it costs a lot of money (here it is necessary to clarify that there is a free version of Cloudera Manager with reduced functionality and a limit of 50 knots). One of the advantages of its HDP platform, Hortonworks, for some reason, declares the ability to download Talend's Talend Open Studio for Big Data solution as an ETL add-in and ELT. I must say that this solution can be completely free to download as an ETL, ELT tool and for Cloudera CDH4, so this is not an advantage inherent only in HDP. I had an acquaintance with Talend Open Studio and I can say that both ELT and ETL solutions are a good choice, with rich functionality and stable, predictable behavior.

Since HDP uses Apache Hadoop 1.0 as a base, it lacks some of the benefits that CDH4 has. In particular, this is HA for NameNode, YARN / MapReduce2, HDFS Federation. However, to solve the problems associated with HA for NameNode, Hortonworks proposes to deliver an add-on based on the VMware vSphere platform, which can provide virtual machine-level fault tolerance for NameNode and JobTracker. In my opinion, a non-trivial solution, with dubious benefits and leading to additional costs.

Hortonworks also decided to build its business with paid support on its HDP platform. Support is sold as an annual subscription and is divided into levels. It is difficult to say about the level of quality support company Hortonworks, since I have not found a single client who would use it at the moment. There are negative reviews about Cloudera support - a very long response period, but this can be said about the support of almost any manufacturer.

Now, Apache Hadoop and related software, from an open source project is transformed into a complete solution, developed by several companies in the world. He has already grown out of the walls of the laboratory and is ready in practice to prove its applicability, as an enterprise solution, for analyzing and storing extremely large amounts of data. At the moment, Hortonworks is in the role of a catch-up, while Cloudera is undoubtedly the leader with its CDH4 platform. In fact, currently many companies have evaluated the prospects of this market and are trying to gain a foothold on it with their own solutions having functionality based on Apache Hadoop or having the ability to work with it. All of them are far behind the two leaders. This means that today there is such a situation that the most complete and working distributions, including all the necessary libraries and programs, are owned by two companies on the market - these are Cloudera with CDH3 and CDH4 and Hortonworks with HDP. These solutions have the right to life as an enterprise tool for analysis in companies where it is necessary. Still, at the moment, when there are very few specialists on the market, deploying Apache Hadoop and configuring it on your own is a long-term process with an uncertain result, i.e. it can be said that this is a long journey of experimenter’s trial and error with various methods provided by open source. In the case of CDH4 and HDP, this is working with solutions that have already proven their performance, and providing support when necessary. Therefore, the question of paying or not paying for Apache Hadoop is not worth it - if you plan to use it for experimental purposes or the company is ready to invest time and money in training its own specialists, then of course you should not pay for it. However, if Apache Hadoop will be used as an enterprise solution, then it is better to have support with an accumulated knowledge base on solving various problems and having a deep understanding of how it works.

Solutions from various companies that commercialize Hadoop in one way or another
Cloudera Inc website
Hortonworks website
A brief history of Apache Hadoop from the creator

Source: https://habr.com/ru/post/151062/

All Articles

Should I pay for Apache Hadoop?

Cloudera Inc.

Hortonworks

More articles: