πŸ“œ ⬆️ ⬇️

Hadoop From Scratch

This article will serve as a practical guide to building, initial configuration, and testing the health of Hadoop beginners administrators. We will analyze how to build Hadoop from source, configure, run and verify that everything works as it should. In the article you will not find the theoretical part. If you have not come across Hadoop before, you don’t know what parts it consists of and how they interact, here are a couple of useful links to official documentation:

hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
hadoop.apache.org/docs/r2.7.3/hadoop-yarn/hadoop-yarn-site/YARN.html

Why not just use the finished distribution?
')
- Training. Similar articles often begin with recommendations to download a virtual machine image with a Cloudera or HortonWorks distribution. As a rule, a distribution kit is a complex ecosystem with a lot of components. It will be difficult for a beginner to figure out where that is and how it all interacts. Starting from scratch, we slightly reduce the threshold of entry, since we can consider the components one by one.

- Functional tests and benchmarks. There is a small lag between the release of a new version of the product, and the moment when it appears in the distribution. If you need to test the new features of the version that has just appeared, you will not be able to use a ready-made distribution. It will also be difficult to compare the performance of two versions of the same software, since in the ready-made distributions there is usually no opportunity to update the version of any one component, leaving everything else as it is.

- Just for fun.

Why collect from sources? After all, Hadoop binary builds are also available.

Part of the Hadoop code is written in C / C ++. I don’t know on which system the development team does the builds, but the C-libraries that come with the Hadoop binary builds depend on the libc version, which is not in RHEL or Debian / Ubuntu. The inoperability of the Hadoop C-libraries is generally not critical, but some features will not work without them.

Why re-describe everything that is already in the official documentation?

The article aims to save time. The official documentation does not contain quickstart-instructions - do it and it will work. If for one reason or another you need to collect the β€œvanilla” Hadoop, but you don’t have time to do it through trial and error, you went to the address.

Assembly


For the assembly we will use CentOS 7. If you believe loudera, most clusters work on RHEL and derivatives (CentOS, Oracle Linux). The 7th version is the most suitable, since its repositories already have the protobuf library of the required version. If you want to use CentOS 6, you will need to build protobuf yourself.

We will carry out assembly and other experiments with root privileges (in order not to complicate the article).

Somewhere 95% of Hadoop code is written in Java. For the assembly, we need Oracle JDK and Maven.

Download the latest version of the JDK from the Oracle site and unzip it to / opt. Also add the JAVA_HOME variable (used by Hadoop) and add / opt / java / bin to the PATH for the root user (for convenience):

cd ~ wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u112-b15/jdk-8u112-linux-x64.tar.gz tar xvf ~/jdk-8u112-linux-x64.tar.gz mv ~/jdk1.8.0_112 /opt/java echo "PATH=\"/opt/java/bin:\$PATH\"" >> ~/.bashrc echo "export JAVA_HOME=\"/opt/java\"" >> ~/.bashrc 

Install Maven. It will be needed only at the assembly stage. Therefore, we will install it in our home (after the end of the assembly, all files that remain in the home can be deleted).

 cd ~ wget http://apache.rediris.es/maven/maven-3/3.3.9/binaries/apache-maven-3.3.9-bin.tar.gz tar xvf ~/apache-maven-3.3.9-bin.tar.gz mv ~/apache-maven-3.3.9 ~/maven echo "PATH=\"/root/maven/bin:\$PATH\"" >> ~/.bashrc source ~/.bashrc 

Somewhere 4-5% of Hadoop code is written in C / C ++. Install the compiler and other packages necessary for the assembly:

  yum -y install gcc gcc-c++ autoconf automake libtool cmake 

We will also need some third-party libraries:

 yum -y install zlib-devel openssl openssl-devel snappy snappy-devel bzip2 bzip2-devel protobuf protobuf-devel 

The system is ready. Download, build and install Hadoop in / opt:

 cd ~ wget http://apache.rediris.es/hadoop/common/hadoop-2.7.3/hadoop-2.7.3-src.tar.gz tar -xvf ~/hadoop-2.7.3-src.tar.gz mv ~/hadoop-2.7.3-src ~/hadoop-src cd ~/hadoop-src mvn package -Pdist,native -DskipTests -Dtar tar -C/opt -xvf ~/hadoop-src/hadoop-dist/target/hadoop-2.7.3.tar.gz mv /opt/hadoop-* /opt/hadoop echo "PATH=\"/opt/hadoop/bin:\$PATH\"" >> ~/.bashrc source ~/.bashrc 

Primary configuration


Hadoop has about a thousand parameters. Fortunately, to run Hadoop and take some first steps in mastering, around 40 is enough, leaving the rest as default.

Let's get started If you remember, we installed Hadoop in / opt / hadoop. All configuration files are in / opt / hadoop / etc / hadoop. In total, you will need to edit 6 configuration files. All configs below are in the form of commands. In order for those who are trying to build their Hadoop on this article, could simply copy the commands to the console.

First, we set the JAVA_HOME environment variable in the hadoop-env.sh and yarn-env.sh files. So we will let all components know where java is installed, which they should use.

 sed -i '1iJAVA_HOME=/opt/java' /opt/hadoop/etc/hadoop/hadoop-env.sh sed -i '1iJAVA_HOME=/opt/java' /opt/hadoop/etc/hadoop/yarn-env.sh 

Configure the URL for HDFS in the core-site.xml file. It consists of the hdfs: // prefix, the name of the host on which the NameNode is running, and the port. If you do not do this, Hadoop will not use the distributed file system, but will work from a local file system on your computer (default URL: file: ///).

 cat << EOF > /opt/hadoop/etc/hadoop/core-site.xml <configuration> <property><name>fs.defaultFS</name><value>hdfs://localhost:8020</value></property> </configuration> EOF 

In the hdfs-site.xml file, we configure 4 parameters. The number of replicas is set to 1, since our β€œcluster” consists of only one node. We also configure the directories where they will store the NameNode, DataNode and SecondaryNameNode data.

 cat << EOF > /opt/hadoop/etc/hadoop/hdfs-site.xml <configuration> <property><name>dfs.replication</name><value>1</value></property> <property><name>dfs.namenode.name.dir</name><value>/data/dfs/nn</value></property> <property><name>dfs.datanode.data.dir</name><value>/data/dfs/dn</value></property> <property><name>dfs.namenode.checkpoint.dir</name><value>/data/dfs/snn</value></property> </configuration> EOF 

We have finished setting up HDFS. It would be possible to run the NameNode and DataNode, and work with the file system. But let's leave it for the next section. We turn to the configuration YARN.

 cat << EOF > /opt/hadoop/etc/hadoop/yarn-site.xml <configuration> <property><name>yarn.resourcemanager.hostname</name><value>localhost</value></property> <property><name>yarn.nodemanager.resource.memory-mb</name><value>4096</value></property> <property><name>yarn.nodemanager.resource.cpu-vcores</name><value>4</value></property> <property><name>yarn.scheduler.maximum-allocation-mb</name><value>1024</value></property> <property><name>yarn.scheduler.maximum-allocation-vcores</name><value>1</value></property> <property><name>yarn.nodemanager.vmem-check-enabled</name><value>false</value></property> <property><name>yarn.nodemanager.local-dirs</name><value>/data/yarn</value></property> <property><name>yarn.nodemanager.log-dirs</name><value>/data/yarn/log</value></property> <property><name>yarn.log-aggregation-enable</name><value>true</value></property> <property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property> <property><name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name><value>org.apache.hadoop.mapred.ShuffleHandler</value></property> </configuration> EOF 

There are quite a few parameters. Let's go through them in order.

The yarn.resourcemanager.hostname parameter indicates which host the ResourceManager service is running on.

The parameters yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores are perhaps the most important. In them, we tell the cluster how much memory and CPU cores each node can use in total to run containers.

The parameters yarn.scheduler.maximum-allocation-mb and yarn.scheduler.maximum-allocation-vcores indicate how much memory and cores can be allocated to a single container. It is easy to see that with this configuration in our β€œcluster” consisting of one node, 4 containers can be launched simultaneously (with 1GB of memory each).

The parameter yarn.nodemanager.vmem-check-enabled set to false disables checking the amount of virtual memory used. As can be seen from the previous paragraph, not much memory is available for each container, and with such a configuration, any application will certainly increase the limit of available virtual memory.

The yarn.nodemanager.local-dirs parameter specifies where the container temporary data will be stored (jar with application bytecode, configuration files, temporary data generated during execution, ...)

The yarn.nodemanager.log-dirs parameter specifies where the logs of each task will be stored locally.

The yarn.log-aggregation-enable parameter specifies to keep logs in HDFS. After the application is completed, its logs from yarn.nodemanager.log-dirs of each node will be moved to HDFS (by default, to the / tmp / logs directory).

The yarn.nodemanager.aux-services and yarn.nodemanager.aux-services.mapreduce_shuffle.class parameters specify the third-party shuffle service for the MapReduce framework.

That's probably all for YARN. I will also give the configuration for MapReduce (one of the possible frameworks for distributed computing). Although it has recently lost its popularity due to the advent of Spark, it is still used a lot.

 cat << EOF > /opt/hadoop/etc/hadoop/mapred-site.xml <configuration> <property><name>mapreduce.framework.name</name><value>yarn</value></property> <property><name>mapreduce.jobhistory.address</name><value>localhost:10020</value></property> <property><name>mapreduce.jobhistory.webapp.address</name><value>localhost:19888</value></property> <property><name>mapreduce.job.reduce.slowstart.completedmaps</name><value>0.8</value></property> <property><name>yarn.app.mapreduce.am.resource.cpu-vcores</name><value>1</value></property> <property><name>yarn.app.mapreduce.am.resource.mb</name><value>1024</value></property> <property><name>yarn.app.mapreduce.am.command-opts</name><value>-Djava.net.preferIPv4Stack=true -Xmx768m</value></property> <property><name>mapreduce.map.cpu.vcores</name><value>1</value></property> <property><name>mapreduce.map.memory.mb</name><value>1024</value></property> <property><name>mapreduce.map.java.opts</name><value>-Djava.net.preferIPv4Stack=true -Xmx768m</value></property> <property><name>mapreduce.reduce.cpu.vcores</name><value>1</value></property> <property><name>mapreduce.reduce.memory.mb</name><value>1024</value></property> <property><name>mapreduce.reduce.java.opts</name><value>-Djava.net.preferIPv4Stack=true -Xmx768m</value></property> </configuration> EOF 

The mapreduce.framework.name parameter indicates that we will run MapReduce tasks in YARN (the default value of local is used only for debugging β€” all tasks are run in the same jvm on the same machine).

The mapreduce.jobhistory.address and mapreduce.jobhistory.webapp.address parameters specify the name of the node on which the JobHistory service will run.

The mapreduce.job.reduce.slowstart.completedmaps parameter instructs the reduce phase to occur no earlier than 80% of the map phase.

The remaining parameters set the maximum possible values ​​of memory and CPU cores and jvm heap for mappers, reducers, and application masters. As you can see, they should not exceed the corresponding values ​​for YARN containers, which we defined in yarn-site.xml. The values ​​of jvm heap are usually set to 75% of the parameters * .memory.mb.

Start


Create a / data directory in which HDFS data will be stored, as well as temporary files of YARN containers.

 mkdir /data 

Format HDFS

 hadoop namenode -format 

And finally, we will launch all services of our β€œcluster”:

 /opt/hadoop/sbin/hadoop-daemon.sh start namenode /opt/hadoop/sbin/hadoop-daemon.sh start datanode /opt/hadoop/sbin/yarn-daemon.sh start resourcemanager /opt/hadoop/sbin/yarn-daemon.sh start nodemanager /opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver 

If everything went well (you can check the error messages in the logs in / opt / hadoop / logs), Hadoop is deployed and ready to go ...

Health check


Look at the hadoop directory structure:

 /opt/hadoop/ β”œβ”€β”€ bin β”œβ”€β”€ etc β”‚ └── hadoop β”œβ”€β”€ include β”œβ”€β”€ lib β”‚ └── native β”œβ”€β”€ libexec β”œβ”€β”€ logs β”œβ”€β”€ sbin └── share β”œβ”€β”€ doc β”‚ └── hadoop └── hadoop β”œβ”€β”€ common β”œβ”€β”€ hdfs β”œβ”€β”€ httpfs β”œβ”€β”€ kms β”œβ”€β”€ mapreduce β”œβ”€β”€ tools └── yarn 

Hadoop itself (executable java-bytecode) is located in the share directory and is divided into components (hdfs, yarn, mapreduce, etc ...). The lib directory contains libraries written in C.

The assignment of other directories is intuitively clear: bin - command line utilities for working with Hadoop, sbin - startup scripts, etc - configs, logs - logs. We are primarily interested in two utilities from the bin directory: hdfs and yarn.

If you remember, we have already formatted HDFS and started all the necessary processes. Let's see what we have in HDFS:

 hdfs dfs -ls -R / drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn/staging drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history/done drwxrwxrwt - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history/done_intermediate 

Although we obviously did not create this directory structure, it was created by the JobHistory service (the last running daemon: mr-jobhistory-daemon.sh start historyserver).

Let's see what is in the / data directory:

 /data/ β”œβ”€β”€ dfs β”‚ β”œβ”€β”€ dn β”‚ β”‚ β”œβ”€β”€ current β”‚ β”‚ β”‚ β”œβ”€β”€ BP-1600342399-192.168.122.70-1483626613224 β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ current β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ finalized β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ rbw β”‚ β”‚ β”‚ β”‚ β”‚ └── VERSION β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ scanner.cursor β”‚ β”‚ β”‚ β”‚ └── tmp β”‚ β”‚ β”‚ └── VERSION β”‚ β”‚ └── in_use.lock β”‚ └── nn β”‚ β”œβ”€β”€ current β”‚ β”‚ β”œβ”€β”€ edits_inprogress_0000000000000000001 β”‚ β”‚ β”œβ”€β”€ fsimage_0000000000000000000 β”‚ β”‚ β”œβ”€β”€ fsimage_0000000000000000000.md5 β”‚ β”‚ β”œβ”€β”€ seen_txid β”‚ β”‚ └── VERSION β”‚ └── in_use.lock └── yarn β”œβ”€β”€ filecache β”œβ”€β”€ log β”œβ”€β”€ nmPrivate └── usercache 

As you can see, in / data / dfs / nn the NameNode created the fsimage file and the first edit file. In / data / dfs / dn DataNode created a directory for storing data blocks, but the data itself is not yet.

Copy some file from a local file system to HDFS:

 hdfs dfs -put /var/log/messages /tmp/ hdfs dfs -ls /tmp/messages -rw-r--r-- 1 root supergroup 375974 2017-01-05 09:33 /tmp/messages 

Look again at the contents of / data

 /data/dfs/dn β”œβ”€β”€ current β”‚ β”œβ”€β”€ BP-1600342399-192.168.122.70-1483626613224 β”‚ β”‚ β”œβ”€β”€ current β”‚ β”‚ β”‚ β”œβ”€β”€ finalized β”‚ β”‚ β”‚ β”‚ └── subdir0 β”‚ β”‚ β”‚ β”‚ └── subdir0 β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ blk_1073741825 β”‚ β”‚ β”‚ β”‚ └── blk_1073741825_1001.meta β”‚ β”‚ β”‚ β”œβ”€β”€ rbw β”‚ β”‚ β”‚ └── VERSION β”‚ β”‚ β”œβ”€β”€ scanner.cursor β”‚ β”‚ └── tmp β”‚ └── VERSION └── in_use.lock 

Hooray!!! The first block and its checksum appeared.

Run some application to make sure that YARN works as expected. For example, pi from hadoop-mapreduce-examples.jar package:

 yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 3 100000 … Job Finished in 37.837 seconds Estimated value of Pi is 3.14168000000000000000 

If you look at the contents of / data / yarn during the execution of the application, you can learn a lot of interesting things about how the YARN applications are executed:

 /data/yarn/ β”œβ”€β”€ filecache β”œβ”€β”€ log β”‚ └── application_1483628783579_0001 β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000001 β”‚ β”‚ β”œβ”€β”€ stderr β”‚ β”‚ β”œβ”€β”€ stdout β”‚ β”‚ └── syslog β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000002 β”‚ β”‚ β”œβ”€β”€ stderr β”‚ β”‚ β”œβ”€β”€ stdout β”‚ β”‚ └── syslog β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000003 β”‚ β”‚ β”œβ”€β”€ stderr β”‚ β”‚ β”œβ”€β”€ stdout β”‚ β”‚ └── syslog β”‚ └── container_1483628783579_0001_01_000004 β”‚ β”œβ”€β”€ stderr β”‚ β”œβ”€β”€ stdout β”‚ └── syslog β”œβ”€β”€ nmPrivate β”‚ └── application_1483628783579_0001 β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000001 β”‚ β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000001.pid β”‚ β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000001.tokens β”‚ β”‚ └── launch_container.sh β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000002 β”‚ β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000002.pid β”‚ β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000002.tokens β”‚ β”‚ └── launch_container.sh β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000003 β”‚ β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000003.pid β”‚ β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000003.tokens β”‚ β”‚ └── launch_container.sh β”‚ └── container_1483628783579_0001_01_000004 β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000004.pid β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000004.tokens β”‚ └── launch_container.sh └── usercache └── root β”œβ”€β”€ appcache β”‚ └── application_1483628783579_0001 β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000001 β”‚ β”‚ β”œβ”€β”€ container_tokens β”‚ β”‚ β”œβ”€β”€ default_container_executor_session.sh β”‚ β”‚ β”œβ”€β”€ default_container_executor.sh β”‚ β”‚ β”œβ”€β”€ job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar β”‚ β”‚ β”œβ”€β”€ jobSubmitDir β”‚ β”‚ β”‚ β”œβ”€β”€ job.split -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/12/job.split β”‚ β”‚ β”‚ └── job.splitmetainfo -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/10/job.splitmetainfo β”‚ β”‚ β”œβ”€β”€ job.xml -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/13/job.xml β”‚ β”‚ β”œβ”€β”€ launch_container.sh β”‚ β”‚ └── tmp β”‚ β”‚ └── Jetty_0_0_0_0_37883_mapreduce____.rposvq β”‚ β”‚ └── webapp β”‚ β”‚ └── webapps β”‚ β”‚ └── mapreduce β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000002 β”‚ β”‚ β”œβ”€β”€ container_tokens β”‚ β”‚ β”œβ”€β”€ default_container_executor_session.sh β”‚ β”‚ β”œβ”€β”€ default_container_executor.sh β”‚ β”‚ β”œβ”€β”€ job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar β”‚ β”‚ β”œβ”€β”€ job.xml β”‚ β”‚ β”œβ”€β”€ launch_container.sh β”‚ β”‚ └── tmp β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000003 β”‚ β”‚ β”œβ”€β”€ container_tokens β”‚ β”‚ β”œβ”€β”€ default_container_executor_session.sh β”‚ β”‚ β”œβ”€β”€ default_container_executor.sh β”‚ β”‚ β”œβ”€β”€ job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar β”‚ β”‚ β”œβ”€β”€ job.xml β”‚ β”‚ β”œβ”€β”€ launch_container.sh β”‚ β”‚ └── tmp β”‚ β”œβ”€β”€ container_1483628783579_0001_01_000004 β”‚ β”‚ β”œβ”€β”€ container_tokens β”‚ β”‚ β”œβ”€β”€ default_container_executor_session.sh β”‚ β”‚ β”œβ”€β”€ default_container_executor.sh β”‚ β”‚ β”œβ”€β”€ job.jar -> /data/yarn/usercache/root/appcache/application_1483628783579_0001/filecache/11/job.jar β”‚ β”‚ β”œβ”€β”€ job.xml β”‚ β”‚ β”œβ”€β”€ launch_container.sh β”‚ β”‚ └── tmp β”‚ β”œβ”€β”€ filecache β”‚ β”‚ β”œβ”€β”€ 10 β”‚ β”‚ β”‚ └── job.splitmetainfo β”‚ β”‚ β”œβ”€β”€ 11 β”‚ β”‚ β”‚ └── job.jar β”‚ β”‚ β”‚ └── job.jar β”‚ β”‚ β”œβ”€β”€ 12 β”‚ β”‚ β”‚ └── job.split β”‚ β”‚ └── 13 β”‚ β”‚ └── job.xml β”‚ └── work └── filecache 42 directories, 50 files 

In particular, we see that the logs are written to / data / yarn / log (the yarn.nodemanager.log-dirs parameter from yarn-site.xml).

At the end of the application / data / yarn comes to its original appearance:

 /data/yarn/ β”œβ”€β”€ filecache β”œβ”€β”€ log β”œβ”€β”€ nmPrivate └── usercache └── root β”œβ”€β”€ appcache └── filecache 

If we look again at the contents of HDFS, we see that log aggregation is working (the logs of the just-executed application were moved from the local FS / data / yarn / log to HDFS / tmp / logs).

We also see that the JobHistory service has saved information about our application in / tmp / hadoop-yarn / staging / history / done.

 hdfs dfs -ls -R / drwxrwx--- - root supergroup 0 2017-01-05 10:12 /tmp drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn drwxrwx--- - root supergroup 0 2017-01-05 10:12 /tmp/hadoop-yarn/staging drwxrwx--- - root supergroup 0 2017-01-05 10:07 /tmp/hadoop-yarn/staging/history drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017 drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01 drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05 drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05/000000 -rwxrwx--- 1 root supergroup 46338 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05/000000/job_1483628783579_0001-1483629144632-root-QuasiMonteCarlo-1483629179995-3-1-SUCCEEDED-default-1483629156270.jhist -rwxrwx--- 1 root supergroup 117543 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done/2017/01/05/000000/job_1483628783579_0001_conf.xml drwxrwxrwt - root supergroup 0 2017-01-05 10:12 /tmp/hadoop-yarn/staging/history/done_intermediate drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/history/done_intermediate/root drwx------ - root supergroup 0 2017-01-05 10:12 /tmp/hadoop-yarn/staging/root drwx------ - root supergroup 0 2017-01-05 10:13 /tmp/hadoop-yarn/staging/root/.staging drwxrwxrwt - root supergroup 0 2017-01-05 10:12 /tmp/logs drwxrwx--- - root supergroup 0 2017-01-05 10:12 /tmp/logs/root drwxrwx--- - root supergroup 0 2017-01-05 10:12 /tmp/logs/root/logs drwxrwx--- - root supergroup 0 2017-01-05 10:13 /tmp/logs/root/logs/application_1483628783579_0001 -rw-r----- 1 root supergroup 65829 2017-01-05 10:13 /tmp/logs/root/logs/application_1483628783579_0001/master.local_37940 drwxr-xr-x - root supergroup 0 2017-01-05 10:12 /user drwxr-xr-x - root supergroup 0 2017-01-05 10:13 /user/root 

Testing in a distributed cluster


You may have noticed that so far I have taken the β€œcluster” in quotes. After all, everything works on the same machine. Correct this unfortunate misunderstanding. Test our Hadoop in a real distributed cluster.

First of all, let's tweak the Hadoop configuration. Currently, the host name in the Hadoop configuration is listed as localhost. If you now just copy this configuration to other nodes, each node will try to find NameNode, ResourceManager, and JobHistory services on its host. Therefore, we will define in advance the name of the host with these services and make changes to the configs.

In my case, all the above master services (NameNode, ResourceManager, JobHistory) will run on master.local. Replace localhost with master.local in the configuration:

 cd /opt/hadoop/etc/hadoop sed -i 's/localhost/master.local/' core-site.xml hdfs-site.xml yarn-site.xml mapred-site.xml 

Now I just clone the virtual machine on which I built two times to get two slaves. On the slave nodes, you need to set a unique host name (in my case it is slave1.local and slave2.local). Also on all three nodes of our cluster, we configure / etc / hosts so that each cluster machine can contact the others by the host name. In my case it looks like this (the same content on all three machines):

 cat /etc/hosts … 192.168.122.70 master.local 192.168.122.59 slave1.local 192.168.122.217 slave2.local 

Additionally, on the nodes slave1.local and slave2.local, you need to clear the contents of / data / dfs / dn

 rm -rf /data/dfs/dn/* 

All is ready. On master.local we start all services:

 /opt/hadoop/sbin/hadoop-daemon.sh start namenode /opt/hadoop/sbin/hadoop-daemon.sh start datanode /opt/hadoop/sbin/yarn-daemon.sh start resourcemanager /opt/hadoop/sbin/yarn-daemon.sh start nodemanager /opt/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver 

On slave1.local and slave2.local, we run only the DataNode and NodeManager:

 /opt/hadoop/sbin/hadoop-daemon.sh start datanode /opt/hadoop/sbin/yarn-daemon.sh start nodemanager 

Let's check that our cluster now consists of three nodes.

For HDFS, let's look at the output of the dfsadmin -report command and make sure that all three machines are included in the Live datanodes list:

 hdfs dfsadmin -report ... Live datanodes (3): … Name: 192.168.122.70:50010 (master.local) ... Name: 192.168.122.59:50010 (slave1.local) ... Name: 192.168.122.217:50010 (slave2.local) 

Or go to the NameNode web page:

master.local : 50070 / dfshealth.html # tab-datanode


For YARN, let's look at the output of the node -list command:

 yarn node -list -all 17/01/06 06:17:52 INFO client.RMProxy: Connecting to ResourceManager at master.local/192.168.122.70:8032 Total Nodes:3 Node-Id Node-State Node-Http-Address Number-of-Running-Containers slave2.local:39694 RUNNING slave2.local:8042 0 slave1.local:36880 RUNNING slave1.local:8042 0 master.local:44373 RUNNING master.local:8042 0 

Or go to the ResourceManager webpage

master.local : 8088 / cluster / nodes


All nodes must be listed as RUNNING.

Finally, make sure that the MapReduce applications that are started use resources on all three nodes. Run the already familiar Pi application from hadoop-mapreduce-examples.jar:

 yarn jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar pi 30 1000 

During the execution of the application, we once again see the output of yarn node -list -all:

 ... Node-Id Node-State Node-Http-Address Number-of-Running-Containers slave2.local:39694 RUNNING slave2.local:8042 4 slave1.local:36880 RUNNING slave1.local:8042 4 master.local:44373 RUNNING master.local:8042 4 

Number-of-Running-Containers - 4 on each node.

We can also go to master.local : 8088 / cluster / nodes and see how many cores and memory are used by all applications in total at each node.



Conclusion


We compiled Hadoop from source code, installed, configured and tested performance on a separate machine and in a distributed cluster. If the topic is interesting to you, if you want to similarly collect other services from the Hadoop ecosystem, leave a link to the script, which I support for my own needs:

github.com/hadoopfromscratch/hadoopfromscratch

With his help, you can install zookeeper, spark, hive, hbase, cassandra, flume. If you find errors or inaccuracies, please write. I would be very grateful.

Source: https://habr.com/ru/post/319048/


All Articles