Windows Azure and Hadoop: “friendship”, ready for Enterprise

Over the past six months, there have been 3 IT events lying in the plane of Big Data , Cloud Computing, and their symbiosis . By a strange coincidence, these events were left without proper attention from both the habrosoobschestva and the few professional networking communities on LinkedIn and Facebook.

The events in question are the Strata + Hadoop World conference, the release of the stable version of Hadoop 2.2.0 and the Windows Azure HDInsight cloud service. The indirect and direct interrelation of these events will be discussed below.

Windows Azure HDInsight 2.1 Ecosystem

Why so many links?

In view of the limited length of the article, I will provide links to resources that will be more useful than any free retelling of the contents of these resources.

The first event is the Strata + Hadoop World conference , which was held from 28.10 to 30.10. Speakers talk about Data Science and the Hadoop platform, use Python and R. Among the long list of sponsors of the event, Cloudera, Hortonworks, MapR, SAP, IBM, Intel, VMware, Microsoft are noted.
')
Vision of the latter (Microsoft) was presented by 3 reports ( 1 , 2 , 3 ). At the same time, the transition of the service providing Hadoop as a Service in Windows Azure, Windows Azure HDInsight to a stable version (Generally Available, GA) was announced .

In HDInsight GA, there are expectedly no new features compared to the CTP version. Another thing is more important: Microsoft is consistent (it often happens with development platforms), and everything that was available to developers in the CTP version remained in the final release. Among the "features" of using HDInsight is that it is a cloud service (pay for use), integration with Microsoft BI tools and the ability to write Job'y in javascript and C # (.NET).

One of the most important strategic advantages of the service, in my opinion, is that HDInsight GA, as in earlier versions, provides a 100% Hadoop-compatible solution.

This is important because it becomes clear that Microsoft will provide a service by subscription, from which you can "leave" without having to radically rewrite the code. This is a strategically important opportunity for startups and small research teams.

Thus, the Hive / Pig / Java code that was written for on-demand jobs (Windows Azure) can be run on the premise Hadoop cluster without any changes.

As for the Hadoop platform itself - on October 15, the release of a stable version of Apache Hadoop 2.x was announced . Apache Hadoop 2.2.0 innovations I wrote earlier ; I will repeat only that the most expected feature of the release is the YARN framework.

Also among the announced changes to Hadoop 2.2.0 was support for running Hadoop on Windows ~~With a candle over the engineers at Hortonworks I was not, but~~ I have a feeling that integration with Windows Server is precisely the fruit of the efforts of Hortonworks engineers with whom Microsoft is working closely.

But Hortonworks still has Apache Hadoop 1.2 enabled in HDP for Windows , which (let the Hadoop community correct me if I'm wrong) does not support the YARN computing framework .

Considering the release of stable Hadoop 2 discussed above, the lack of support for the YARN framework is a big limitation of Windows Azure HDInsight. Another limitation of the HDInsight service is the maximum available cluster size - 32 nodes .

Create HDInsight cluster

Although, in fairness, it is worth noting that the stable version of Hadoop came out 2 weeks ago, Hortonworks is already working on supporting the latest Hadoop distribution in HDP for Windows (HDP 2 for Windows), and the limitation on the number of nodes is likely solved by a support request Windows Azure.

The limitations of Hadoop 2.x are not yet understood by the community. I note that most features in version 2 are only a circumvention of the limitations of version 1 . Separately, I will highlight the computational framework YARN, which not only allowed to get rid of the strong connection with the map / reduce program model , in particular, but also represents a qualitative leap in the development of the Hadoop platform , as a whole.

Yarn

Looking at Google’s work in the field in Big Data, I’ll add to the potential limitations of the platform - the lack of geodistribution support (both replication and execution). The disadvantages of working with the community (primarily academic) are the lack of pdf-paper, similar to those “released” by Google and Microsoft’s research divisions, Berkeley UC researchers, describing the principles and architectural approaches used in designing a particular computer system ( One famous example: MapReduce: Simplied Data Processing on Large Clusters ).

Instead of a conclusion. How to try?

Hadoop 2.0

Cloudera: Apache Hadoop 2.x distribution kit (+ additional components of the Hadoop ecosystem) - CDH 5 Beta .
Hortonworks: Apache Hadoop 2.x distribution kit (+ additional components of the Hadoop ecosystem) - Hortonworks Data Platform 2.0 .

Window Azure HDInsight

Cloud service: Hadoop as a Service in Windows Azure - Windows Azure HDInsight .
Locally: Windows Azure HDInsight Emulator - HDInsight Emulator for Windows Azure (install via the Web Platform Installer).

Source: https://habr.com/ru/post/200780/

All Articles

Windows Azure and Hadoop: “friendship”, ready for Enterprise

Instead of a conclusion. How to try?

Hadoop 2.0

Window Azure HDInsight

More articles: