Well, Habrazhiteli, it's time to summarize a 
series of articles ( 
part 1 and 
part 2 ) on my adventure with automating the deployment of the 
Hadoop cluster.

My project is almost ready, it remains only to test the process and you can notch yourself on the fuselage.
In this article I will talk about raising the “driving force” of our cluster - 
Slaves , and also 
summarize and provide useful 
links to the 
resources that I used throughout my project. Perhaps someone seemed to have scant articles on the source code and implementation details, so at the end of the article I will provide a link to Github
Well, out of habit, at the very beginning I will give an architectural scheme that I managed to deploy to the cloud.

In our case, given the 
test character of the run, only 2 
Slave nodes were used, but in real conditions there would be dozens. Further, I will briefly describe how their deployment was organized.
Deploying Slaves Nodes
As you might guess from the architecture - the 
Slave- node consists of 2 
parts , each of which is responsible for the actions related to parts of the 
Masters architecture. 
A DataNode is the point of interaction between a 
Slave node and a 
NameNode , which coordinates distributed 
data storage .
The 
DataNode process connects to the service on the 
NameNode- node, after which 
Clients can handle file operations directly to the 
DataNode- nodes. It is also worth noting that 
DataNode nodes communicate with each other in the case 
of data replication , which in turn allows us to avoid using 
RAID arrays, since The replication mechanism is already programmed.
The 
DataNode deployment process is quite simple:
- Installing prerequisites in the form of Java ;
- Adding repositories with packages of Hadoop distribution;
- Creating the backbone of the directories needed to install the DataNode ;
- Generating configuration files based on the template and cookbook attributes
- Installing Distribution Packages ( hadoop-hdfs-datanode )
- Starting the DataNode process by service hadoop-hdfs-datanode start;
- Register the status of the deployment process.
As a result, if all the data is correct and the configuration is applied, you can see the added 
Slave- nodes on the 
NameNode- nodes web interface. This means that the 
DataNode node is now available for file operations related to distributed data storage. Copy the file to HDFS and see for yourself.
The NodeManager , in turn, is responsible for interacting with the 
ResourceManager , which manages the 
tasks and 
resources available to perform them. The deployment process of 
NodeManager is similar to the process in the case of 
DataNode , with the difference in the name of the package for installation and service ( 
hadoop-yarn-nodemanager ).
After successfully completing the deployment of 
Slaves nodes, you can consider our cluster ready. It is worth paying attention to the files that set 
environment variables (hadoop_env, yarn_env, etc.) - the data in the variables must correspond to the real values in the cluster. Also, you should pay attention to the accuracy of the values of the variables, which indicate the 
domain names and 
ports on which a particular service is running.
How can we test the performance of the cluster? The most accessible option is to start a 
task from one of the 
Clients nodes. For example, like this:
 hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 2 5 
where 
hadoop-mapreduce-examples-2.2.0.jar is the name of the file with the 
task description (available in the base installation), 
pi indicates the 
type of task (MapReduce Task in this case), and 
2 and 
5 are responsible for the 
distribution of the tasks ( more details - 
here ).
After all the calculations, the result will be 
output to the terminal with statistics and the result of the calculations, or the creation of an 
output file with data output there (the nature of the data and the output format depend on the task described in the 
.jar file).
<End />')
These are the clusters and cakes, dear 
Habrazhiteli . At this stage - I do not pretend to the 
ideality of this solution, since there are still stages of testing and making improvements / edits to the 
cookbook code. I wanted to share my experience, describe another 
approach to the deployment of the 
Hadoop cluster - the approach is not the easiest and most orthodox, I would say. But it is in such unconventional conditions - steel is tempered. My final goal is a modest analogue of the 
Amazon MapReduce service, for our private cloud.
I very much welcome 
advice from everyone who will pay attention to this 
series of articles (special 
thanks to ffriend , who paid attention and asked questions, some of which led me to new ideas).
Links to materials
As promised, here is a list of 
materials that, along with my colleagues, helped to bring the project to an acceptable form:
- Detailed documentation on 
HDP distribution kit - 
docs.hortonworks.com- Fathers 
Wiki , 
Apache Hadoop - 
wiki.apache.org/hadoop- 
Documentation from them - 
hadoop.apache.org/docs/current- A little outdated (in terms of) article-description of the 
architecture - 
here- A good 
tutorial in 2 parts - 
here- Adapted translation from 
martsen tutorial - 
habrahabr.ru/post/206196- Community 
cookbook " 
Hadoop ", on the basis of which I made my project - 
Hadoop cookbook- In the end - my humble 
project as it is (ahead of the update) - 
GitHubThank you all for your attention! 
Comments are welcome! If I can 
help with something - please contact! Until new meetings.
UPD. Added the link to article on Habré, translation of tutorial.