Some time ago, at the Strata + Hadoop World conference, the launch of Windows Azure HDInsight , a 100% Apache Hadoop-compatible cloud service, was announced . Details about the history of the appearance of the service and its capabilities can be found in this article on Habré. About the announcements of the Strata + Hadoop World conference can be found in another fresh article .
This article will focus on the internal structure of HDInsight clusters, start working with them and the first tasks and requests for Hive. At the end of the article are real examples of the use of Windows Azure HDInsight by large international organizations.
Windows Azure HDInsight offers the following benefits to its users:
Windows Azure HDInsight is a 100% compatible Apache Hadoop distribution available on the Windows Azure platform as a service. Instead of building its own distribution, Microsoft chose a partnership with Hortonworks to port Apache Hadoop to the Windows platform. Micrsoft has invested more than 6,000 man-hours and over 25,000 lines of code in various Apache Hadoop ecosystem projects.
')
The architecture of the cluster in the cloud, obtained on request as a service is presented in the following picture:
The figure shows the following elements:
Secure Role or Gateway Node is a reverse proxy that works as a gateway to your Hadoop cluster. Secure Role is responsible for authentication and authorization tasks and provides endpoints for WebHcat, Ambari, HiveServer, HiveServer2 and Oozie on port 433. In order to connect to the cluster, you use your credentials specified when creating the cluster;
Head Node is a node represented by an Extra Large virtual machine (8 cores, 14 GB RAM). In HDInsight Head Node performs an important role, taking on the key functions of the Hadoop cluster: NameNode, Secondary NameNode and JobTracker. Head Node contains and executes the following operational and data services:
Worker Nodes are nodes represented by Large virtual machines (4 cores, 7 GB RAM). Worker Roles is responsible for launching services that support task scheduling, task execution, and data access:
Windows Azure Storage-BLOB (WASB) —The default file system in your Windows Azure HDInsight cluster is represented as Windows Azure Blob Storage. Microsoft implemented a thin layer on top of Blog Storage, which represents storage in the form of an HDFS file system, which we call Windows Azure Storage-Blob or WASB.
The great news is that you can interact with WASB using DFS commands using the Blob Storage REST API or through numerous popular utilities .
Another remarkable feature of WASB is that all the data you store in it will be available to the HDInsight cluster, and after its deletion will remain intact. If you want to delete the cluster and then re-create a new one, then you can simply point the old cluster to the old data and use it.
It is cheaper to store data in WASB, because when working with them there is no need to pay for outgoing traffic or computing power required when working with local HDFS storage inside virtual machines.
Finally, storing data in WASB will allow you to share it with other cloud services or applications running outside of your cluster. The reverse is also true: data stored in Windows Azure Storage from other services can be easily obtained in an HDInsight cluster.
Details on the benefits of using WASB can be found at the following link .
Local HDFS - except for WASB in the HDInsight cluster, you also have access to local HDFS storage, but its use is not encouraged, as it is more expensive (working with it equals cluster operation) and everything that is stored in the local HDFS will be deleted along with the cluster when you refuse it.
Application versions available in HDInsight . Today, within the Hadoop cluster in the HDInsight service, the following versions of applications and services are available to you on request:
Component | Version |
Apache hadoop | 1.2.0 |
Apache heve | 0.11.0 |
Apache pig | 0.11 |
Apache sqoop | 1.4.3 |
Apache oozie | 3.2.2 |
Apache hcatalog | Merged with hive |
Apache templeton | 0.1.4 |
SQL Server JDBC Driver | 3.0 |
Ambari | API v1.0 |
You can always find the latest version information on the HDInsight cluster components at the following link .
An HDInsight cluster can be created from the Windows Azure Management Portal by selecting HDInsight from the Data Services menu. In order to create a cluster, you need to specify the name, cluster size and number of data nodes (Data Nodes), password to access the cluster.
A cluster must contain at least one associated Windows Azure Storage, which will be the permanent storage location for this cluster, and the cluster and the storage must be located in the same region. You can associate additional storage for the cluster using the custom cluster creation option.
Deploying and configuring a cluster in the cloud will only take a few minutes. As soon as it is deployed, you can go to the welcome page, which offers additional links to useful resources and code examples that you can practice working with Hadoop.
On the Dashboard tab, you will see information about the current status of your cluster, including various metrics: node consumption, task history, and associated data stores.
Before you run your first task in a cluster, you need to prepare your environment for using PowerShell cmdlets. To work with these cmdlets, you need to install Windows Azure Powershell and HDInsight PowerShell. To do this, simply follow the links in “Step 1” of your welcome page in the Windows Azure Cluster Control Panel.
On the welcome page, you can also find sample commands for working with both Hive and MapReduce tasks. We will start working with MapReduce.
Run the example using the following commands to create job definitions (job definition). Definitional tasks contain all the information needed for the task, for example, which mappers or reducers to use, which data to use as input and where to place the output data. In the sample code, we use the MapReduce program and a test file that is already contained in the cluster. We will also create a directory to save the output.
$ jarFile = "/example/jars/hadoop-examples.jar"
$ className = "wordcount"
$ statusDirectory = "/ samples / wordcount / status"
$ outputDirectory = "/ samples / wordcount / output"
$ inputDirectory = "/ example / data / gutenberg"
$ wordCount = New - AzureHDInsightMapReduceJobDefinition - JarFile $ jarFile - ClassName
$ className - Arguments $ inputDirectory , $ outputDirectory - StatusFolder $ statusDirectory
Run the following commands to get information about your Windows Azure subscription and to start running the MapReduce program. MapReduce tasks usually last a long time, here we use an example to demonstrate how to use asynchronous commands to start a task.
$ subscriptionId = ( Get - AzureSubscription - Current ) . SubscriptionId
$ wordCountJob = $ wordCount | Start - AzureHDInsightJob - Cluster HadoopIsAwesome -
Subscription $ subscriptionId | Wait - AzureHDInsightJob - Subscription $ subscriptionId
Finally, run the following command to get and display all the results of the task.
Get - AzureHDInsightJobOutput - Subscription ( Get - AzureSubscription - Current ) . SubscriptionId
- Cluster bc - newhdstorage - JobId $ wordCountJob . JobId –StandardError
The result of this command and the output of the task execution information you will see in the terminal, as shown below.
The output of the task was placed in your repository in the "/ samples / wordcount / output ” directory. Open the repository viewer in the Windows Azure portal and browse to this file to download it and examine the contents.
On the start page there are examples of commands for connecting to a cluster and launching Hive tasks. Click on the Hive button in the Job Type switch to access the example.
Run the following command to connect to your cluster.
Use - AzureHDInsightCluster HadoopIsAwesome ( Get - AzureSubscription - Current ) . SubscriptionID
Then run the following command to launch the HiveQL query to the cluster. This query uses a test table that is already placed in the cluster when it is created.
Invoke - Hive "select country, state, count (*)
This query is an example of a simple query with select and group by, after its execution you will see the results in the PowerShell window:
In this article, we looked at how easy it is to create and run an HDInsight cluster and start analyzing your data. But HDInsight offers significantly more features that you can explore, such as downloading your own data sets, running complex sophisticated tasks, and analyzing results. To learn more about how to work with HDInsight, visit the documentation page or use the following direct links to articles (in English):
Articles in Russian are available on the portal AzureHub.ru:
For pricing information on the Windows Azure HDInsight service, visit this page .
The article partially used information from
This article is the official blog.
Despite the fact that the commercial availability of the service was announced recently, a large number of companies and organizations have already tried the service at the preview stage. Among them, the following examples can be singled out:
The city of Barcelona chose Microsoft's business intelligence and big data processing tools, including Windows Azure HDInsight. Announcement and detailed description of the example;
Virginia Polytechnic Institute uses HDInsight to process genome data. Detailed description of the example ;
Danish research organization Chr. Hansen, a developer of natural ingredients for food, pharmacological, and agricultural industries, uses Windows Azure HDInsight to increase data processing speeds by a factor of 100 compared to its previous method. Detailed description of the example .
Company 343 Industries - Halo 4 game developers are using Windows Azure HDInsight to conduct analytical research based on data from over 50 million copies of Halo games sold to make online services even better. Detailed description of the example .
Medical company Ascribe Ltd from the UK - a leader in its field - uses Windows Azure HDInsight to improve the quality of clinical research, offering researchers a much faster way to process large data from a large number of sources. Detailed description of the example .
And if you are already developing on Windows Azure or want to find the developers of your service, visit appprofessionals.ru .
We will be happy to answer your questions at azurerus@microsoft.com . And we are waiting for you in the Windows Azure Community on Facebook . Here you will find experts (don't forget to ask them questions), photos, and lots and lots of news.
Source: https://habr.com/ru/post/200750/
All Articles