Setting up a small Hadoop 2.2.0 cluster from scratch

In this article, the process of creating a small Hadoop cluster for experiments will be taken apart in steps.

Despite the fact that there are a lot of material on configuring / deploying Hadoop on the Internet on foreign resources, most of them either describe configuration of earlier versions (0.XX and 1.XX), or describe only configuration in single mode / pseudo distributed mode and only partially fully distributed mode. There is practically no material in Russian at all.
')
When I needed Hadoop myself, I was far from the first time able to set everything up. The material was irrelevant, there were often configs that use deprecated parameters, so it is undesirable to use them. And even when everything was set up, I asked many questions to which I was looking for answers. There were also similar questions from other people.

Anyone interested, please come on cat.

Presets

As an operating system for our cluster, I suggest using Ubuntu Server 12.04.3 LTS , but with minimal changes it will be possible to do all the steps on a different OS.

All nodes will work on VirtualBox. System settings for the virtual machine, I exhibited small. Only 8 GB of space for the hard disk, one core and 512 MB of memory. The virtual machine is also equipped with two network adapters: one is NAT and the other is for the internal network.

After the operating system has been downloaded and installed, you need to upgrade and install ssh and rsync:

sudo apt-get update && sudo apt-get upgrade sudo apt-get install ssh sudo apt-get install rsync

Java

For Hadoop, you can use either version 6 or version 7.
In this article we will work with OpenJDK version 7:

 $ sudo apt-get install openjdk-7-jdk

Although you can use a version from Oracle.

And How?

Clean up the OS from all dependencies OpenJDK sudo apt-get purge openjdk *
Install python-software-properties which allows you to add new PPA:

 sudo apt-get install python-software-properties

Add PPA from launchpad.net/~webupd8team/+archive/java

 sudo add-apt-repository ppa:webupd8team/java sudo apt-get update sudo apt-get install oracle-java7-installer

Create a separate account to run Hadoop

We will use the dedicated account to run Hadoop. This is not required, but recommended. We will also give the new user sudo rights to make their lives easier in the future.

 sudo addgroup hadoop sudo adduser --ingroup hadoop hduser sudo usermod -aG sudo hduser

When creating a new user, you will need to enter a password.

/ etc / hosts

We need all nodes to easily communicate with each other. In a large cluster, it is advisable to use the dns server, but the hosts file will be suitable for our small configuration. In it, we will describe the correspondence of the node ip-address to its name on the network. For a single node, your file should look something like this:

 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 192.168.0.1 master

Ssh

To manage the cluster nodes hadoop, you need ssh access. For the created user hduser, grant access to master.
First you need to generate a new ssh key:

 ssh-keygen -t rsa -P ""

During key creation a password will be requested. Now you can not enter it.

The next step is to add the generated key to the list of authorized:

 cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

We check the performance by connecting to yourself:

 ssh master

Disable IPv6

If you do not disable IPv6, then later you can get a lot of problems.
To disable IPv6 in Ubuntu 12.04 / 12.10 / 13.04, you need to edit the sysctl.conf file:

 sudo vim /etc/sysctl.conf

Add the following parameters:

 # IPv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1

Save and reboot the operating system.

But I need IPv6!

To disable ipv6 only in hadoop, you can add it to the file etc / hadoop / hadoop-env.sh:

 export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

Install Apache Hadoop

Download the necessary files.
Current versions of the framework are located at: www.apache.org/dyn/closer.cgi/hadoop/common

As of December 2013, the stable version is 2.2.0.

Create a downloads folder in the root directory and download the latest version:

 sudo mkdir /downloads cd downloads/ sudo wget http://apache-mirror.rbc.ru/pub/apache/hadoop/common/stable/hadoop-2.2.0.tar.gz

Unpack the contents of the package in / usr / local /, rename the folder and give the user hduser creator rights:

 sudo mv /downloads/hadoop-2.2.0.tar.gz /usr/local/ cd /usr/local/ sudo tar xzf hadoop-2.2.0.tar.gz sudo mv hadoop-2.2.0 hadoop chown -R hduser:hadoop hadoop

$ HOME / .bashrc Update

For convenience, let's add a list of variables to .bashrc:

 #Hadoop variables export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386 export HADOOP_INSTALL=/usr/local/hadoop export PATH=$PATH:$HADOOP_INSTALL/bin export PATH=$PATH:$HADOOP_INSTALL/sbin export HADOOP_MAPRED_HOME=$HADOOP_INSTALL export HADOOP_COMMON_HOME=$HADOOP_INSTALL export HADOOP_HDFS_HOME=$HADOOP_INSTALL export YARN_HOME=$HADOOP_INSTALL

At this step, the preliminary preparations are completed.

Apache Hadoop setup

All subsequent work will be carried out from the / usr / local / hadoop folder.
Open etc / hadoop / hadoop-env.sh and set JAVA_HOME.

 vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

We describe what nodes we will have in the cluster in the file etc / hadoop / slaves

 master

This file can be located only on the main node. All new nodes need to be described here.

The main system settings are located in etc / hadoop / core-site.xml:

 <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>

HDFS settings are in etc / hadoop / hdfs-site.xml:

 <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/hdfs/datanode</value> </property> </configuration>

Here, the dfs.replication parameter sets the number of replicas that will be stored on the file system. By default, its value is

3. It can not be more than the number of nodes in the cluster.
The dfs.namenode.name.dir and dfs.datanode.data.dir parameters specify the paths where the HDFS data and information will be physically located. You must create a tmp folder in advance.

Let our cluster know that we want to use YARN. To do this, change the etc / hadoop / mapred-site.xml:

 <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>

All settings for YARN work are described in the file etc / hadoop / yarn-site.xml:

 <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> </configuration>

The resourcemanager settings are needed so that all nodes in the cluster can be seen in the control panel.

Format HDFS:

 bin/hdfs namenode –format

Run the hadoop service:

 sbin/start-dfs.sh sbin/start-yarn.sh

* In the previous version of Hadoop, the sbin / start-all.sh script was used, but since version 2. *. * It has been declared obsolete.

You need to make sure that the following java-processes are running:

 hduser@master:/usr/local/hadoop$ jps 4868 SecondaryNameNode 5243 NodeManager 5035 ResourceManager 4409 NameNode 4622 DataNode 5517 Jps

You can test the cluster using standard examples:

 bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar

Now we have a ready-made image, which will serve as the basis for creating a cluster.

Then you can create the required number of copies of our image.

On the copies you need to configure the network. It is necessary to generate new MAC addresses for network interfaces and issue the necessary ip-addresses to them. In my example, I work with addresses like 192.168.0.X.

Correct the / etc / hosts file on all nodes of the cluster so that it contains all matches.

Example

 127.0.0.1 localhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters 192.168.0.1 master 192.168.0.2 slave1 192.168.0.3 slave2

For convenience, change the names of the new nodes on slave1 and slave2.

How?

Two files need to be changed: / etc / hostname and / etc / hosts.

Generate new SSH keys on the nodes and add them all to the list of authorized ones on the master node.

On each node in the cluster, we will change the dfs.replication parameter values in etc / hadoop / hdfs-site.xml. For example, set the value to 3 everywhere.

etc / hadoop / hdfs-site.xml

 <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>

Add new nodes on the master node to the etc / hadoop / slaves file:

 master slave1 slave2

When all settings are registered, then we can start our cluster on the main node.

 bin/hdfs namenode –format sbin/start-dfs.sh sbin/start-yarn.sh

The following processes should start on the slaves:

 hduser@slave1:/usr/local/hadoop$ jps 1748 Jps 1664 NodeManager 1448 DataNode

Now we have our own mini cluster.

Let's run the Word Count task.
To do this, we need to load several text files into HDFS.
For example, I took books in txt format from the Free ebooks site - Project Gutenberg .

Test files

 cd /home/hduser mkdir books cd books wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt wget http://www.gutenberg.org/cache/epub/5000/pg5000.txt wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt wget http://www.gutenberg.org/cache/epub/972/pg972.txt wget http://www.gutenberg.org/cache/epub/132/pg132.txt wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt wget http://www.gutenberg.org/cache/epub/19699/pg19699.txt

Transfer our files to HDFS:

 cd /usr/local/hadoop bin/hdfs dfs -mkdir /in bin/hdfs dfs -copyFromLocal /home/hduser/books/* /in bin/hdfs dfs -ls /in

Run Word Count:

 bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out

You can track the work through the console, or through the ResourceManager’s web interface at master : 8088 / cluster / apps /

Upon completion, the result will be located in the / out folder in HDFS.
In order to download it to the local file system let's execute:

 bin/hdfs dfs -copyToLocal /out /home/hduser/

If you have any questions, ask them in the comments.

Source: https://habr.com/ru/post/206196/

All Articles