Hello!

I continue my “merry” series of articles devoted to getting to know
Hadoop and automating the deployment of a cluster.
In the
first part, I briefly described what needed to be achieved, what cluster architecture to build and what the
Hadoop cluster is from an architectural point of view. Also, I considered, probably, the simplest part of the cluster -
Clients , which is responsible for setting tasks, providing data for calculations and getting results.
Now it's time to talk about the part of the cluster architecture, which is
Masters - namely,
HDFS and
YARN .
Let me give you another example of the
architecture that was to be deployed in a private cloud. It was intended solely for test needs, the actual data load was not provided.

So, it was assumed that we will have
2 nodes responsible for the
NameNode- role, between which
HA + Failover is configured based on
Zookeeper .
NameNode, in turn, is responsible for coordinating the data in our distributed
HDFS file system.
NameNode own the directory tree and monitor the files distributed across our cluster. By themselves,
NameNode nodes do not store data, because for this role we have
slaves .
JournalNode , in turn, is necessary for us if we implement
High Availability based on
QJM (Quorum Journal Manager), the essence of which is that dedicated virtual machines (
JournalNode ) are used to synchronize between
Active and
Stanbdy NameNode , which contain lists of changes in HDFS. The logs of these changes are available to both NameNode, respectively, in any case of failover, we achieve synchronization of our
NameNode- nodes.
Another
high availability option is using
NFS / NAS . The essence is approximately the same - there
is a storage "
mounted " on the network, in which all so-called ones are written.
shared edits , mentioned changes logs in
HDFS .
In our case,
YARN is responsible for
1 node - the
ResourceManager , which manages tasks, calculations, distributes computing resources between tasks, and is also responsible for accepting and setting tasks for execution.
')
Deploy HDFS
<Lyrical digression>It should be noted that initially the idea of automating the deployment process of the
Hadoop cluster looked quite exotic, in view of the rather “gentle” nature of it (although it may be my personal experience ...). What I mean? During automation, it turned out that the slightest deviation from the documentation led to a large number of questions or to the identification of problems previously documented in
JIRA by the developers of
Hadoop distributions.
In a word, cookbook turned out to be not the closest to the "
best practices " in my understanding, since it contains a fair amount of Bash code, execute resources, and only occasionally (where possible) uses
Ruby DSL and
Chef resources .
Also, the idea that there is
Apache Ambari , the native
Hadoop cluster management system, was
haunted . But again - this is not our method, and I wanted to understand the process without the help of third-party software.
</ Lyrical digression>The time has come to dig deeper into the deployment process and the
cookbook code, which will be responsible for automating the cluster deployment process.
I mentioned in the first article that the
cookbook I wrote was based on
Community practices , around which I created a wrapper. In fact, not everything went smoothly, because in the
community cookbook version available at the time of my work start - there were several problems, sometimes very annoying - a typo, the wrong team and other trifles (we must pay tribute to the developer, who very quickly responded to my modest contribution and standing questions).
In fact, the achievements of the community were an excellent foundation for starting work, adding some solutions if possible.
Well, I started with the
NameNode deployment and the
HA + Failover mechanism. This process for
Linux nodes can be described as follows:
- Installing prerequisites in the form of Java ;
- Adding repositories with packages of Hadoop distribution;
- Creating the backbone of the directories needed to install NameNode ;
- Generating configuration files based on the template and cookbook attributes
- Installing distribution packages (hdfs-namenode, zookeeper-server, etc)
- Exchange ssh-keys between NameNode (required for controlled failover );
- Depending on the role of the node ( Active / Standby ) - starting processes for cluster operation (hereinafter - more);
- Register the status of the deployment process.
Starting the processes necessary for
NameNode to function consisted of the following steps:
- chown directories related to Hadoop components (for example, the NameNode HDFS installation directories) is a very important step, because in most cases leads to problems;
hdfs namenode -format
, which formats and initializes the directory specified as dfs.namenode.name.dir ;service zookeeper-server start
- start the zookeeper server used during failoverhdfs zkfc -formatZK
- creates a znode (Zookeeper Data Node, in other words - a participant in the failover process);$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config /etc/hadoop/conf.chef/ --script hdfs start namenode"
- to start the Active NameNode process
or
hdfs namenode -bootstrapStandby
- to run the Standby NameNode process on the node;$HADOOP_PREFIX/sbin/hadoop-daemon.sh --config /etc/hadoop/conf.chef/ start zkfc
- to start the ZooKeeper Failover Controller process;node.set['hadoop_services']['already_namenode'] = true
— sets the status of the process, Ruby is the code that sets the attribute value of the node attribute.
After
Chef has successfully worked on
2 nodes , following these steps, you can begin the installation verification process. There are several verification options that are best used:
- Open the NameNode DFS Health web page, accessible by default at the following address - FQDN: 50070 - which provides information about the role of the node ( Active or Standby ), available Slaves , as well as various system information and logs;
- Using the jps utility (provided in the Hadoop distribution and analogous to ps ), the output of which must include the NameNode , DFSZKFailoverController and QuorumPeerMain processes ;
- Launching local
hdfs haadmin -healthCheck
and hdfs haadmin -getServiceState
, the result of which, according to the stunt, is shown on the web page in a more detailed format; - To verify the failover process, you can call a controlled failover mechanism as follows:
hdfs haadmin -failover
, as a result of which NameNode nodes should switch to Active / Standby roles.
A successful result is a bunch of
2 nodes , one of which takes on the role of
Active NameNode , the other -
Standby NameNode ; between the nodes (
znode ) a
Zookeeper control is installed, which is able to conduct a
failover process; access to the
Slave nodes and the file system (the deployment of the
Slave will be discussed in the next article).
As already mentioned, the
high availability of our cluster can be achieved in 2 ways - using
NFS / NAS or a dedicated
JournalNode node.
In the case of
NFS / NAS , all we need is to be able to “mount” the storage over the network to our NameNode. The storage should be available
24/7 (highly desirable), accessible with
low latency, and also respond quickly to
read / write operations over the network.
In the case of using
JournalNode , it is necessary to select the node on which the JournalNode package is installed from the Hadoop distribution. For the configuration in the basic version, there are 2 parameters:
dfs.journalnode.http-address — the parameter that indicates the FQDN and the port on which the
JournalNode service is
running );
dfs.journalnode.edits.dir - directory in which
logs of events occurring in
HDFS are added.
Deploying YARN
The YARN cluster part is represented in our
ResourceManager architecture as a node that works with tasks and resources for their execution.
The process of deploying this node looks simpler than
NameNode :
- Installing prerequisites in the form of Java ;
- Adding repositories with packages of Hadoop distribution;
- Creating the backbone of the directories needed to install NameNode ;
- Generating configuration files based on the template and cookbook attributes
- Installing distribution packages ( hadoop-yarn-resourcemanager )
- Starting the ResourceManager process by
service hadoop-yarn-resourcemanager start
; - Register the status of the deployment process.
To verify successful deployment, you can do the following:
- Open the ResourceManager web page, accessible, by default, at the following address - FQDN: 8088 - which provides data on available Slaves , as well as various information about the tasks and resources allocated for their execution;
- Using the jps utility (provided in the Hadoop distribution and analogous to ps ), in the output of which the ResourceManager process must be present;
That's about my part
Masters . At the
end of the next article, I plan to highlight a section in which I will describe the
minimum cluster settings that are
REQUIRED to run. Also, in the next article, I will publish links to my modest project and useful documentation, which I used in the process of creating it.
Thank you all for your attention! Comments and especially
amendments are
very welcome!