
The continuous growth of data and an increase in the speed of their generation raises the problem of their processing and storage. Not surprisingly, the “Big Data” theme is one of the most talked about in today's IT community.
There are quite a lot of materials on the theory of “big data” in specialized journals and on websites today. But from theoretical publications it is far from always clear how one can use appropriate technologies to solve specific practical problems.
')
One of the most well-known and discussed projects in the field of distributed computing is Hadoop, a freely distributed set of utilities, libraries and a framework for developing and executing distributed computing programs developed by the Apache Software Foundation.
We have been using Hadoop for a long time to solve our own practical problems. The results of our work in this area are worth sharing with the general public. This article is the first in a series about Hadoop. Today we will talk about the history and structure of the Hadoop project, and also show, using the example of the Hadoop Cloudera distribution, how the cluster is deployed and configured.
Carefully, under a cat a lot of traffic.A bit of history
Hadoop author - Doug Cutting, creator of Apache Lucene, the famous text search library. The name of the project is the name that Doug's son came up with for his plush yellow elephant.
Cutting created Hadoop while working on the Nutch project, an open source web search system. The Nutch project was launched in 2002, but very soon its developers realized that the existing architecture is unlikely to scale to billions of web pages. In 2003, an article was published describing the distributed file system GFS (Google File System) used in Google projects. Such a system could easily cope with the task of storing large files generated by crawling and indexing sites. In 2004, the Nutch development team took up the implementation of such an open source system - NDFS (Nutch Distributed File System).
In 2004, Google introduced the MapReduce technology to a wide audience. Already at the beginning of 2005, Nutch developers created a full-fledged implementation of MapReduce based on Nutch; shortly thereafter, all the main Nutch algorithms were adapted to use MapReduce and NDFS. In 2006, Hadoop was allocated to an independent subproject under the Lucene project.
In 2008, Hadoop became one of the leading Apache projects. By that time, it was already successfully used in companies such as Yahoo !, Facebook and Last.Fm. Today, Hadoop is widely used both in commercial companies and in scientific and educational institutions.
Hadoop project structure
The Hadoop project includes the following subprojects:
- Common - a set of components and interfaces for distributed file systems and general input / output;
- Map Reduce is a distributed computing model designed for parallel computing over very large (up to several petabytes) data volumes;
- HDFS is a distributed file system running on large clusters of typical machines.
Previously, Hadoop included other subprojects, which are now stand-alone products of the Apache Software Foundation:
- Avro - serialization system for inter-language RPC calls and long-term data storage;
- Pig is a data flow control language and execution environment for analyzing large amounts of data;
- Hive - distributed data storage; it manages data stored in HDFS and provides a SQL-based query language for working with this data;
- HBase - non-relational distributed database;
- ZooKeeper - distributed coordination service; provides primitives for building distributed applications;
- Sqoop - a tool for transferring data between structured storage and HDFS;
- Oozie is a service for recording and scheduling Hadoop jobs.
Hadoop distributions
Today Hadoop is a complex system consisting of a large number of components. Installing and configuring such a system yourself is a very difficult task. Therefore, many companies today offer ready-made Hadoop distributions, which include deployment, administration and monitoring tools.
Hadoop distributions are distributed both under commercial (products of companies such as Intel, IBM, EMC, Oracle) and under free (products of Cloudera, Hortonworks and MapR companies) licenses. We will tell you more about the Cloudera Hadoop distribution.
Cloudera hadoop
Cloudera Hadoop is a completely open distribution created with the active participation of Apache Hadoop developers Doug Cutting and Mike Cafarella. It is distributed in both free and paid version, known as Cloudera Enterprise.
At the time when we became interested in the Hadoop project, Cloudera provided the most complete and complete solution among the open source Hadoop distributions. For all the time there was not a single significant problem, and the cluster successfully survived several major updates, which were fully automatic. And now, after almost a year of experiments, we can say that we are satisfied with the choice made.
Cloudera Hadoop includes the following main components:
- Cloudera Hadoop (CDH) - the actual Hadoop distribution;
- Cloudera Manager is a tool for deploying, monitoring, and managing a Hadoop cluster.
The components of Cloudera Hadoop are distributed in binary packages called
parsels . Compared to standard packages and package managers, parsels have the following advantages:
- ease of loading: each parcel is one file in which all the necessary components are combined;
- internal consistency: all components inside the parsell are thoroughly tested, debugged and consistent with each other, so the likelihood of problems with component incompatibility is very small;
- Differentiation of distribution and activation: you can first install parsels on all managed nodes, and then activate them with a single action; thanks to this, the system is updated quickly and with minimal downtime;
- updates “on the go”: when a minor version is updated, all new processes (tasks) will be automatically launched under this version, already running tasks will continue to be executed in the old environment until they are completed. However, upgrading to a new major version is only possible through a full restart of all cluster services, and therefore all current tasks;
- a simple reversal of changes: in case of any problems with the new version of CDH, it can be easily rolled back to the previous one.
Hardware requirements
Hardware requirements for deploying Hadoop is a fairly complex topic. Different requirements are made to different nodes within the cluster. You can read more about this, for example, in the
recommendations of Intel or in the
blog of Cloudera. The general rule: more memory and disks! RAID controllers and other enterprise joys are not necessary due to the very architecture of Hadoop and HDFS, designed to work on typical simple servers. The use of 10GB network cards is justified with data volumes of more than 12TB per node.
The Cloudera blog provides the following list of hardware configurations for various boot options:
- "Light" configuration (1U) - 2 six-core processors, 24-64 GB of memory, 8 hard drives with a capacity of 1-2 TB;
- rational configuration (1U) - 2 six-core processors, 48-128 GB of memory, 12-16 hard drives (1 or 2 TB) connected directly through the controller of the motherboard;
- "Heavy" configuration for storage (2U): 2 six-core processors, 48-96 GB of memory, 16-24 hard drives. With multiple failures in the nodes in this configuration, a sharp increase in network traffic occurs;
- configuration for intensive computing: 2 six-core processors, 64-512 GB of memory, 4-8 hard drives with a capacity of 1-2 TB.

Note that in the case of server rental, losses from a poorly chosen configuration are not as terrible as when purchasing their servers. If necessary, you can modify the leased servers or replace them with more suitable for your tasks.
We proceed directly to the installation of our cluster.
Install and configure the OS
For all servers we will use CentOS 6.4 in a minimal installation, but other distributions can be used: Debian, Ubuntu, RHEL, etc. Required packages are publicly available at archive.cloudera.com and are installed by standard package managers.
On the Cloudera Manager server, we recommend using software or hardware RAID1 and one root partition, you can put it on a separate partition / var / log /. On servers that will be added to the hadoop cluster, we recommend creating two partitions:
- "/" 50-100GB in size under the OS and software Cloudera Hadoop;
- “/ Dfs” over LVM to all available disks for HDFS data storage;
- “Swap” is better to make very small, about 500MB. Ideally, servers should not swap at all, but if this happens, a small swap will save the processes from OOM-killer.
On all servers, including the Cloudera Manager server, you need to disable SELinux and the firewall. Of course, you can not do this, but then you have to spend a lot of time and effort to fine-tune security policies. To ensure security, it is recommended to isolate the cluster as much as possible from the outside world at the network level, for example, using a hardware firewall or an isolated VLAN (access to mirrors through a local proxy).
# vi / etc / selinux / config # turn off SElinux
SELINUX = disabled
# system-config-firewall-tui # turn off the firewall and save the settings
# reboot
We offer examples of ready-made kickstart files for the automatic installation of Cloudera Manager servers and cluster nodes.
Example cloudera_manager.ks install text reboot ### General url --url http://mirror.selectel.ru/centos/6.4/os/x86_64 # disable SELinux for CDH selinux --disabled rootpw supersecretpasswrd authconfig --enableshadow --enablemd5 # Networking firewall - -disabled network --bootproto = static --device = eth0 --ip = 1.2.3.254 --netmask = 255.255.255.0 --gateway = 1.2.3.1 --nameserver = 188.93.16.19.109.234.159.91,188.93.17.19 - -hostname = cm.example.net # Regional keyboard us lang en_US.UTF-8 timezone Europe / Moscow ### Partitioning zerombr yes bootloader --location = mbr --driveorder = sda, sdb clearpart --all --initlabel part raid .11 --size 1024 --asprimary --ondrive = sda part raid.12 --size 1 --grow --asprimary --ondrive = sda part raid.21 --size 1024 --asprimary --ondrive = sdb part raid.22 --size 1 --grow --asprimary --ondrive = sdb raid / boot --fstype ext3 --device md0 --level = RAID1 raid.11 raid.21 raid pv.01 --fstype ext3 - device md1 - level = RAID1 raid.12 raid.22 volgroup vg0 pv.01 logvol swap - vgname = vg0 - size = 12288 - name = swap - fstyp e = ext4 logvol / --vgname = vg0 --size = 1 --grow --name = vg0-root --fstype = ext4% packages @Base wget ntp% post - erroronfail chkconfig ntpd on wget -q -O / etc / yum.repos.d / cloudera-manager.repo http://archive.cloudera.com/cm4/redhat/6/x86_64/cm/cloudera-manager.repo rpm --import http: //archive.cloudera. com / cdh4 / redhat / 6 / x86_64 / cdh / RPM-GPG-KEY-cloudera yum -y install jdk yum -y install cloudera-manager-daemons yum -y install cloudera-manager-server yum -y install cloudera-manager- server-db
Example node.ks install
text
reboot
### General
url --url http://mirror.selectel.ru/centos/6.4/os/x86_64
# disable SELinux for CDH
selinux --disabled
rootpw nodeunifiedpasswd
authconfig --enableshadow --enablemd5
# Networking
firewall --disabled
network --bootproto = static --device = eth0 --ip = 1.2.3.10 --netmask = 255.255.255.0 --gateway = 1.2.3.1 --nameserver = 188.93.16.19,109.234.159.91,188.93.17.19 --hostname = node.example.net
# Regional
keyboard us
lang en_US.UTF-8
timezone Europe / Moscow
### Partitioning
zerombr yes
bootloader --location = mbr --driveorder = sda
clearpart --all - initlabel
part / boot --fstype ext3 --size 1024 --asprimary --ondrive = sda
part pv.01 --fstype ext3 --size 1 --grow --asprimary --ondrive = sda
# repeat for every hard drive
part pv.01 --fstype ext3 --size 1 --grow --asprimary --ondrive = sdb
part pv.01 --fstype ext3 --size 1 --grow --asprimary --ondrive = sdc
volgroup vg0 pv.01
logvol swap --vgname = vg0 --size = 512 --name = swap --fstype = ext4
logvol / --vgname = vg0 --size = 51200 --name = vg0-root --fstype = ext4
logvol / dfs --vgname = vg0 --size = 1 --grow --name = dfs --fstype = ext4
% packages
@Base
wget
ntp
% post --erroronfail
chkconfig ntpd on
Installing Cloudera Manager
Let's start by installing Cloudera Manager, which will then deploy and configure our future Hadoop cluster on servers.
Before installation, you must make sure that:
- all servers included in the cluster are accessible via ssh, and they have the same root password (or the public ssh key is added);
- all nodes must have access to standard package repositories (have access to the Internet or access to a local repository / proxy);
- all servers in the cluster have access to archive.cloudera.com or to a local repository with the necessary installation files;
- ntp is installed on all servers and time synchronization is configured;
- all the nodes in the cluster and the CM server are configured with DNS and PTR records (or all the hosts must be registered in the / etc / hosts of all servers).
Add a Cloudera mirror and install the necessary packages:
# wget -q -O /etc/yum.repos.d/cloudera-manager.repo http://archive.cloudera.com/cm4/redhat/6/x86_64/cm/cloudera-manager.repo
# rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera
# yum -y install jdk
# yum -y install cloudera-manager-daemons
# yum -y install cloudera-manager-server
# yum -y install cloudera-manager-server-db
At the end of the installation, we launch the standard database (for simplicity, we will use it, although you can connect any third-party) and the CM service itself:
# /etc/init.d/cloudera-scm-server-db start
# /etc/init.d/cloudera-scm-server start
Deploying Cloudera Hadoop Cluster
After installing Cloudera Manager, you can forget about the console, all further interaction with the cluster we will implement using the Cloudera Manager web interface. By default, Cloudera Manager uses port 7180. You can use either the DNS name or the IP address of the server. We enter this address in the browser line.
A login window will appear on the screen. Login and password for login - standard (admin, admin). Of course, they need to be changed immediately.
A window opens with an offer to choose the version of Cloudera Hadoop: free, trial for 60 days or a paid license:

Select the free (Cloudera Standard) version. Trial or paid license can be activated later at any time when you are already comfortable with the work with the cluster.
During installation, the Cloudera Manager service will connect over SSH to the servers in the cluster; it performs all actions on the servers on behalf of the user specified in the menu, the default is root.
Next, Cloudera Manager will ask you to specify the addresses of the hosts where Cloudera Hadoop will be installed:

Addresses can be specified by list and by mask, for example:
- 10.1.1. [1-4] means that the cluster will include nodes with IP addresses 10.1.1.1, 10.1.1.2, 10.1.1.3, 10.1.1.4
- host [07-10] .example.com - host07.example.com, host08.example.com, host09.example.com, host10.example.com.
After that click on the Search button. Cloudera Manager will detect the specified hosts, and a list of them will be displayed on the screen:

We check once again whether all the necessary hosts are included in this list (you can add new hosts by clicking the New Search button). Then click on the Continue button. The repository selection window will open:

As a method of installation, we recommend choosing the installation with parsels, we have already described their advantages earlier. Parsels are installed from the archive.cloudera.org repository. In addition to the CDH parsell, you can install the SOLR search tool and a database based on Hadoop IMPALA from the same repository.
After selecting the installments for installation, click on the Continue button. In the next window, specify the parameters for access via SSH (login, password or private key, port number for connection):

After that click on the button Continue. The installation process will start:

Upon completion of the installation, a table with a summary of the installed components and their versions will be displayed on the screen:

Once again, we check whether everything is in order, and click on the Continue button. A window will appear asking you to select the components and services of Cloudera Hadoop to be installed:

For example, we will install all the components by selecting the “All Services” option, later it will be possible to install or delete any services. Now you need to specify which components of Cloudera Hadoop will be installed on specific hosts. We recommend that you trust the default selection; in more detail, recommendations on the location of roles on nodes can be found in the documentation for a specific service.

Click the Continue button and proceed to the next step - setting up the database:

By default, all information related to monitoring and managing the system is stored in the PostgreSQL database that we installed with the Cloudera Manager. Other databases can also be used - in this case, select Use Custom Database from the menu. After setting the necessary parameters, we check the connection with the “Test Connection” database, and if successful, click on the “Continue” button to proceed to setting up the elements in the cluster:

Click the Continue button and start the cluster setup process. The adjustment is displayed on the screen:

When the configuration of all components is completed, go to the dashboard of our cluster. For example, here is the dashboard of our test cluster:

Instead of conclusion
In this article, we tried to introduce you to installing the Hadoop cluster and to show that when using ready-made distributions, such as Cloudera Hadoop, this takes very little time and effort. I recommend to continue acquaintance with Hadoop with the book of Tom White “Hadoop: The Definitive Guide”, there is an edition in Russian.
Working with Cloudera Hadoop on the example of specific usage scenarios will be discussed in the following articles of the cycle. The upcoming publication will be devoted to Flume, a universal tool for collecting logs and other data.
For those who can not comment on posts on Habré, we invite to our
blog .