
In this article, the process of creating a small Hadoop cluster for experiments will be taken apart in steps.
Despite the fact that there are a lot of material on configuring / deploying Hadoop on the Internet on foreign resources, most of them either describe configuration of earlier versions (0.XX and 1.XX), or describe only configuration in single mode / pseudo distributed mode and only partially fully distributed mode. There is practically no material in Russian at all.
')
When I needed Hadoop myself, I was far from the first time able to set everything up. The material was irrelevant, there were often configs that use deprecated parameters, so it is undesirable to use them. And even when everything was set up, I asked many questions to which I was looking for answers. There were also similar
questions from other people.
Anyone interested, please come on cat.
Presets
As an operating system for our cluster, I suggest using
Ubuntu Server 12.04.3 LTS , but with minimal changes it will be possible to do all the steps on a different OS.
All nodes will work on VirtualBox. System settings for the virtual machine, I exhibited small. Only 8 GB of space for the hard disk, one core and 512 MB of memory. The virtual machine is also equipped with two network adapters: one is NAT and the other is for the internal network.
After the operating system has been downloaded and installed, you need to upgrade and install ssh and rsync:
sudo apt-get update && sudo apt-get upgrade sudo apt-get install ssh sudo apt-get install rsync
Java
For Hadoop, you can use either version 6 or version 7.
In this article we will work with OpenJDK version 7:
$ sudo apt-get install openjdk-7-jdk
Although you can use a version from Oracle.
Create a separate account to run Hadoop
We will use the dedicated account to run Hadoop. This is not required, but recommended. We will also give the new user sudo rights to make their lives easier in the future.
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser sudo usermod -aG sudo hduser
When creating a new user, you will need to enter a password.
/ etc / hosts
We need all nodes to easily communicate with each other. In a large cluster, it is advisable to use the dns server, but the hosts file will be suitable for our small configuration. In it, we will describe the correspondence of the node ip-address to its name on the network. For a single node, your file should look something like this:
127.0.0.1 localhost
Ssh
To manage the cluster nodes hadoop, you need ssh access. For the created user hduser, grant access to master.
First you need to generate a new ssh key:
ssh-keygen -t rsa -P ""
During key creation a password will be requested. Now you can not enter it.
The next step is to add the generated key to the list of authorized:
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
We check the performance by connecting to yourself:
ssh master
Disable IPv6
If you do not disable IPv6, then later you can get a lot of problems.
To disable IPv6 in Ubuntu 12.04 / 12.10 / 13.04, you need to edit the sysctl.conf file:
sudo vim /etc/sysctl.conf
Add the following parameters:
Save and reboot the operating system.
But I need IPv6!To disable ipv6 only in hadoop, you can add it to the file etc / hadoop / hadoop-env.sh:
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true
Install Apache Hadoop
Download the necessary files.
Current versions of the framework are located at:
www.apache.org/dyn/closer.cgi/hadoop/commonAs of December 2013, the stable version is 2.2.0.
Create a downloads folder in the root directory and download the latest version:
sudo mkdir /downloads cd downloads/ sudo wget http://apache-mirror.rbc.ru/pub/apache/hadoop/common/stable/hadoop-2.2.0.tar.gz
Unpack the contents of the package in / usr / local /, rename the folder and give the user hduser creator rights:
sudo mv /downloads/hadoop-2.2.0.tar.gz /usr/local/ cd /usr/local/ sudo tar xzf hadoop-2.2.0.tar.gz sudo mv hadoop-2.2.0 hadoop chown -R hduser:hadoop hadoop
$ HOME / .bashrc Update
For convenience, let's add a list of variables to .bashrc:
At this step, the preliminary preparations are completed.
Apache Hadoop setup
All subsequent work will be carried out from the / usr / local / hadoop folder.
Open etc / hadoop / hadoop-env.sh and set JAVA_HOME.
vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
We describe what nodes we will have in the cluster in the file etc / hadoop / slaves
master
This file can be located only on the main node. All new nodes need to be described here.
The main system settings are located in etc / hadoop / core-site.xml:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://master:9000</value> </property> </configuration>
HDFS settings are in etc / hadoop / hdfs-site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/hdfs/datanode</value> </property> </configuration>
Here, the dfs.replication parameter sets the number of replicas that will be stored on the file system. By default, its value is
3. It can not be more than the number of nodes in the cluster.
The dfs.namenode.name.dir and dfs.datanode.data.dir parameters specify the paths where the HDFS data and information will be physically located. You must create a tmp folder in advance.
Let our cluster know that we want to use YARN. To do this, change the etc / hadoop / mapred-site.xml:
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
All settings for YARN work are described in the file etc / hadoop / yarn-site.xml:
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>master:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>master:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>master:8088</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>master:8031</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>master:8033</value> </property> </configuration>
The resourcemanager settings are needed so that all nodes in the cluster can be seen in the control panel.
Format HDFS:
bin/hdfs namenode –format
Run the hadoop service:
sbin/start-dfs.sh sbin/start-yarn.sh
* In the previous version of Hadoop, the sbin / start-all.sh script was used, but since version 2. *. * It has been declared obsolete.
You need to make sure that the following java-processes are running:
hduser@master:/usr/local/hadoop$ jps 4868 SecondaryNameNode 5243 NodeManager 5035 ResourceManager 4409 NameNode 4622 DataNode 5517 Jps
You can test the cluster using standard examples:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar
Now we have a ready-made image, which will serve as the basis for creating a cluster.
Then you can create the required number of copies of our image.
On the copies you need to configure the network. It is necessary to generate new MAC addresses for network interfaces and issue the necessary ip-addresses to them. In my example, I work with addresses like 192.168.0.X.
Correct the / etc / hosts file on all nodes of the cluster so that it contains all matches.
For convenience, change the names of the new nodes on slave1 and slave2.
How?Two files need to be changed: / etc / hostname and / etc / hosts.
Generate new SSH keys on the nodes and add them all to the list of authorized ones on the master node.
On each node in the cluster, we will change the dfs.replication parameter values in etc / hadoop / hdfs-site.xml. For example, set the value to 3 everywhere.
etc / hadoop / hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
Add new nodes on the master node to the etc / hadoop / slaves file:
master slave1 slave2
When all settings are registered, then we can start our cluster on the main node.
bin/hdfs namenode –format sbin/start-dfs.sh sbin/start-yarn.sh
The following processes should start on the slaves:
hduser@slave1:/usr/local/hadoop$ jps 1748 Jps 1664 NodeManager 1448 DataNode
Now we have our own mini cluster.
Let's run the Word Count task.
To do this, we need to load several text files into HDFS.
For example, I took books in txt format from the
Free ebooks site
- Project Gutenberg .
Test files cd /home/hduser mkdir books cd books wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt wget http://www.gutenberg.org/cache/epub/5000/pg5000.txt wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt wget http://www.gutenberg.org/cache/epub/972/pg972.txt wget http://www.gutenberg.org/cache/epub/132/pg132.txt wget http://www.gutenberg.org/cache/epub/1661/pg1661.txt wget http://www.gutenberg.org/cache/epub/19699/pg19699.txt
Transfer our files to HDFS:
cd /usr/local/hadoop bin/hdfs dfs -mkdir /in bin/hdfs dfs -copyFromLocal /home/hduser/books/* /in bin/hdfs dfs -ls /in
Run Word Count:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out
You can track the work through the console, or through the ResourceManager’s web interface at
master : 8088 / cluster / apps /
Upon completion, the result will be located in the / out folder in HDFS.
In order to download it to the local file system let's execute:
bin/hdfs dfs -copyToLocal /out /home/hduser/
If you have any questions, ask them in the comments.