Start and work with Hadoop in Windows Azure

Is it possible to deploy a Hadoop cluster in Windows Azure? Of course, yes, and as you can see, it is not difficult at all.

In this article, I will demonstrate the order of how to create a typical cluster with Name Node servers, Job Tracker servers and a managed number of Slave servers. You can dynamically change the number of Slave servers using the Azure Management Portal. I will leave the description of the mechanics of this control to the next post.

Follow these steps to create an Azure package for your Hadoop cluster from scratch:

Loading the necessary tools

this project template for Visual Studio 2010 is configured by default with roles for each component of the Hadoop cluster. If you do not have Visual Studio or you do not want to install the free Express version , you can do all the work from the command line;
cluster configuration templates ;
Install the latest version of the Azure SDK , at the time of writing ( and translation ) of this article - it was version 1.4;
Binary components Hadoop . I used version 0.21. Hadoop is distributed as a tar.gz file, you need to convert it to a ZIP archive, for which you can use the free 7-zip archiver;
Now you need to install Cygwin , and then pack it into a zip archive. Hadoop 0.21 requires Cygwin for Windows. It's okay if you do not know anything about Cygwin, Hadoop uses it himself, so you don’t even have to run it yourself. In the future, Hadoop 0.22 will probably lose this dependency , but for now this version is not yet ready. Just run the Cygwin installer and accept all the default settings in the installation wizard. Cygwin will be installed in the C: \ cygwin folder, pack this folder into the cygwin.zip archive;
Download the latest version of Yet Another Java Service Wrapper . At the time of writing, this was version Beta-10.6;
the last component you need is a Java virtual machine to run Hadoop and YAJSW. If you do not want to update the configuration files presented in this guide, then you will need to package your favorite JVM in a zip archive called jdk.zip. All JVM files can be found at C: \ Program Files \ Java \ jdk1.6.0_ <revision number> \. You need to copy all the files from this folder to the jdk folder and pack it into a zip archive.

Configuring Your Cluster

The cluster-config.zip file that you downloaded earlier contains all the necessary files for configuring your Hadoop cluster. You can find familiar files in it [core | hdfs | mapred] -site.xml. Do not pay attention to them yet, I will tell you about their appointment in the next article. Edit the * -site.xml files according to the cluster configuration parameters you require. Make sure you just add new properties and do not modify existing ones.

Create a new archive cluster-config.zip if you made any changes to the cluster configuration.

Download all components to Azure Storage

Create a container named bin and load into it all the zip archives created earlier. Use your favorite tool, for example, ClumsyLeaf's CloudXplorer . After this, you should have the following files in your container:

Configuring Azure Hosting

Unzip the Visual Studio 2010 project. Next, you can use either Visual Studio 2010 or any text editor. I included the batch file in the template for those of you who will use the command line.

If you are using Visual Studio, then you only need to modify one NameNode \ SetEnvironment.cmd file. It is among other project files. If you are not using Visual Studio, you will need to modify this file in the three other paths NameNode \ bin \ Debug, JobTracker \ bin \ Debug, and Slaves \ bin \ Debug. Use your access key to your Azure storage account and build a connection string, then replace the text [your connection string] with the first line of the SetEnvironment.cmd files. The Azure connection string has the following format:

DefaultEndpointsProtocol = http; AccountName = [your_account_name]; AccountKey = [key]

If you used the same component versions as me, you will not need to make any more changes.

The placement in Azure will be configured to use the single Large role for the Name Node, one Large role for the Job Tracker and 4 Extra Large roles for the Slave nodes. If you are satisfied with this configuration, then go to the next step. If, on the contrary, you want to change the configuration of roles, then use the Visual Studio's option to configure roles or manually edit the HadoopAzure \ ServiceDefinition.csdef and HadoopAzure \ ServiceConfiguration.cscfg files to set the size and number of roles.

Placing your cluster in Azure

Create a new service to host Hadoop in Azure. The project you have is configured to allow remote access to the machines in the cluster. If you did not change the project configuration, then you need to upload the AzureHadoop.pfx certificate to the root of the project of your service. The certificate password is hadoop . Placement will not be completed successfully unless you have this certificate.

If you are using Visual Studio 2010, you can place the project simply by right-clicking on the project and choosing the Deploy command. If you are not using VS2010, then simply run buildPackage.cmd from the project root using the Windows Azure SDK Command Prompt console tool. You will receive the Hadoop.cspkg package for posting to Azure using the Azure Management Portal.

Place your service in Azure. Wait for the placement to finish and you will see something similar to this structure:

Using Your Hadoop Cluster

Now that everything is up and running, you can go to the Name Node Summary page. The URL for this page will be

https: // <your service name> .cloudapp.net: 50070

If you click the “Browse the filesystem” link, then Hadoop will build a URL for you with the IP address of one of your Slave nodes. In this video, the URL is not available for navigation, so you need to replace the IP address in it with the line <your service name> .cloudapp.net . After that you can go to the file system structure:

Let's run one of the demo tasks that come with Hadoop to make sure the cluster is working. According to the current configuration, you need to login to Job Tracker to start a new task. In the following articles I will talk about alternatives to this step (hint - Azure Connect).

Return to the Azure Management Portal and go to Job Tracker using RDP, select it and click Connect on the control panel. Use the login hadoop and password H1Doop . After you connect, open the command line panel and execute several commands:

E: \ AppRoot \ SetEnvironment.cmd

cd / d% HADOOP_HOME%

Now you can run the task. I combined the scripts for Hadoop, so you don’t have to deal with Cygwin when running tasks. The command syntax is the same as when executing regular scripts. Let's run a simple command:

bin \ hadoop jar hadoop-mapred-examples-0.21.0.jar pi 20 200

If you now go to the Job Tracker page, you will see that the task is running. URL - https: // <your service name> .cloudapp.net: 50030.

Congratulations, you have just launched the first Hadoop task on Windows Azure!

What can I do with my Hadoop cluster?

The cluster is fully functional. You can perform any task you wish. In addition, you can use the Azure Management Portal to dynamically change the number of Slave nodes. Hadoop will detect the new or remote node or nodes and reconfigures the cluster accordingly.

I added an additional slave node:

And my cluster has changed accordingly:

If you have already used Hadoop in practice, then you know that several additional steps are required to prepare the Name Node, mainly to ensure fault tolerance. This is a separate topic, which I plan to discuss in the next article. If you do not want to wait and want to set up a node for backup and or a checkpoint node, then this may be part of the solution. Using Azure Drive may be another part of the solution.

And let me know about your experience using Hadoop in Windows Azure.

Source: https://habr.com/ru/post/119497/

All Articles