Is it possible to deploy a Hadoop cluster in Windows Azure? Of course, yes, and as you can see, it is not difficult at all.
In this article, I will demonstrate the order of how to create a typical cluster with Name Node servers, Job Tracker servers and a managed number of Slave servers. You can dynamically change the number of Slave servers using the Azure Management Portal. I will leave the description of the mechanics of this control to the next post.
Follow these steps to create an Azure package for your Hadoop cluster from scratch:
The cluster-config.zip file that you downloaded earlier contains all the necessary files for configuring your Hadoop cluster. You can find familiar files in it [core | hdfs | mapred] -site.xml. Do not pay attention to them yet, I will tell you about their appointment in the next article. Edit the * -site.xml files according to the cluster configuration parameters you require. Make sure you just add new properties and do not modify existing ones.
Create a new archive cluster-config.zip if you made any changes to the cluster configuration.
Create a container named bin and load into it all the zip archives created earlier. Use your favorite tool, for example, ClumsyLeaf's CloudXplorer . After this, you should have the following files in your container:
Unzip the Visual Studio 2010 project. Next, you can use either Visual Studio 2010 or any text editor. I included the batch file in the template for those of you who will use the command line.
If you are using Visual Studio, then you only need to modify one NameNode \ SetEnvironment.cmd file. It is among other project files. If you are not using Visual Studio, you will need to modify this file in the three other paths NameNode \ bin \ Debug, JobTracker \ bin \ Debug, and Slaves \ bin \ Debug. Use your access key to your Azure storage account and build a connection string, then replace the text [your connection string] with the first line of the SetEnvironment.cmd files. The Azure connection string has the following format:
DefaultEndpointsProtocol = http; AccountName = [your_account_name]; AccountKey = [key]
If you used the same component versions as me, you will not need to make any more changes.
The placement in Azure will be configured to use the single Large role for the Name Node, one Large role for the Job Tracker and 4 Extra Large roles for the Slave nodes. If you are satisfied with this configuration, then go to the next step. If, on the contrary, you want to change the configuration of roles, then use the Visual Studio's option to configure roles or manually edit the HadoopAzure \ ServiceDefinition.csdef and HadoopAzure \ ServiceConfiguration.cscfg files to set the size and number of roles.
Create a new service to host Hadoop in Azure. The project you have is configured to allow remote access to the machines in the cluster. If you did not change the project configuration, then you need to upload the AzureHadoop.pfx certificate to the root of the project of your service. The certificate password is hadoop . Placement will not be completed successfully unless you have this certificate.
If you are using Visual Studio 2010, you can place the project simply by right-clicking on the project and choosing the Deploy command. If you are not using VS2010, then simply run buildPackage.cmd from the project root using the Windows Azure SDK Command Prompt console tool. You will receive the Hadoop.cspkg package for posting to Azure using the Azure Management Portal.
Place your service in Azure. Wait for the placement to finish and you will see something similar to this structure:
Now that everything is up and running, you can go to the Name Node Summary page. The URL for this page will be
https: // <your service name> .cloudapp.net: 50070
If you click the “Browse the filesystem” link, then Hadoop will build a URL for you with the IP address of one of your Slave nodes. In this video, the URL is not available for navigation, so you need to replace the IP address in it with the line <your service name> .cloudapp.net . After that you can go to the file system structure:
Let's run one of the demo tasks that come with Hadoop to make sure the cluster is working. According to the current configuration, you need to login to Job Tracker to start a new task. In the following articles I will talk about alternatives to this step (hint - Azure Connect).
Return to the Azure Management Portal and go to Job Tracker using RDP, select it and click Connect on the control panel. Use the login hadoop and password H1Doop . After you connect, open the command line panel and execute several commands:
E: \ AppRoot \ SetEnvironment.cmd
cd / d% HADOOP_HOME%
Now you can run the task. I combined the scripts for Hadoop, so you don’t have to deal with Cygwin when running tasks. The command syntax is the same as when executing regular scripts. Let's run a simple command:
bin \ hadoop jar hadoop-mapred-examples-0.21.0.jar pi 20 200
If you now go to the Job Tracker page, you will see that the task is running. URL - https: // <your service name> .cloudapp.net: 50030.
Congratulations, you have just launched the first Hadoop task on Windows Azure!
The cluster is fully functional. You can perform any task you wish. In addition, you can use the Azure Management Portal to dynamically change the number of Slave nodes. Hadoop will detect the new or remote node or nodes and reconfigures the cluster accordingly.
I added an additional slave node:
And my cluster has changed accordingly:
If you have already used Hadoop in practice, then you know that several additional steps are required to prepare the Name Node, mainly to ensure fault tolerance. This is a separate topic, which I plan to discuss in the next article. If you do not want to wait and want to set up a node for backup and or a checkpoint node, then this may be part of the solution. Using Azure Drive may be another part of the solution.
And let me know about your experience using Hadoop in Windows Azure.
Source: https://habr.com/ru/post/119497/
All Articles