
Introduction
As a person with a not very stable psyche, one glance at a picture like this is enough for me to start a panic attack. But I decided that I would only suffer myself. The purpose of the article is to make Hadoop look not so scary.
What will happen in this article:
- Let us examine what the framework consists of and why it is needed;
- analyze the issue of painless cluster deployment;
- look at a specific example;
- let's touch on the new features of Hadoop 2 (Namenode Federation, Map / Reduce v2).
')
What will not be in this article:
- in general, an overview article, therefore, without complications;
- let's not go into the finer points of the ecosystem;
- let's not dig deep into the jungle API;
- We will not consider all about devops-tasks.
What is Hadoop and why is it needed?
Hadoop is not that complicated; the kernel consists of the HDFS file system and the MapReduce framework for processing data from this file system.
If you look at the question “why do we need Hadoop?” From the point of view of use in a large enterprise, then there are quite a lot of answers, and they vary from “strongly for” to “very against”. I recommend viewing the
ThoughtWorks article.
If you look at the same question from a technical point of view - for which tasks it makes sense for us to use Hadoop - this is also not so simple. The manuals first understand two basic examples: word count and log analysis. Well, what if I don’t have word count or log analysis?
It would be nice to determine the answer as something simple. For example, SQL - you need to use if you have a lot of structured data and you really want to talk with the data. Ask as many questions as possible in advance of an unknown nature and format.
The long answer is to look at a number of existing solutions and implicitly assemble, under the subcortex, the conditions for which Hadoop is needed. You can poke around on blogs, I can still advise you to read the Mahmoud Parsian
book Data Algorithms: Recipes for Scaling up with Hadoop and Spark .
I'll try to answer shorter. Hadoop should be used if:
- The calculations must be composable, in other words, you must be able to run the calculations on a subset of the data, and then merge the results.
- You plan to process a large amount of unstructured data — more than you can fit on one machine (> several terabytes of data). The advantage here is the ability to use the commodity hardware for the cluster in the case of Hadoop.
Hadoop should not be used:
- For uncombined tasks - for example, for recurrent problems.
- If the entire amount of data fits on one machine. Significantly save time and resources.
- Hadoop as a whole is a batch processing system and is not suitable for real-time analysis (here the Storm system comes to the rescue).
HDFS architecture and typical Hadoop cluster
HDFS is similar to other traditional file systems: files are stored as blocks, there is a mapping between blocks and file names, a tree structure is supported, a rights-based access model is supported, etc.
Differences HDFS:
- Designed to store a large number of huge (> 10GB) files. One Corollary - large block size compared to other file systems (> 64MB)
- Optimized to support streaming data access (high-streaming read), respectively, the performance of random data reading operations begins to limp.
- Focuses on the use of a large number of inexpensive servers. In particular, servers use the JBOB structure (Just a bunch of disk) instead of RAID - mirroring and replication are performed at the cluster level, and not at the individual machine level.
- Many of the traditional problems of distributed systems are embedded in the design - by default, the entire failure of individual nodes is absolutely normal and natural operation, and not something out of the ordinary.
Hadoop cluster consists of three types of nodes: NameNode, Secondary NameNode, Datanode.
Namenode is the brain of the system. As a rule, one node per cluster (more in the case of the Namenode Federation, but we leave this case overboard). It stores all the metadata of the system - directly mapping between files and blocks. If node 1, then it is also Single Point of Failure. This problem is solved in the second version of Hadoop using the
Namenode Federation .
Secondary NameNode - 1 node per cluster. It is customary to say that “Secondary NameNode” is one of the most unfortunate names in the entire history of programs. Indeed, the Secondary NameNode is not a replica of the NameNode. The state of the file system is stored directly in the fsimage file and in the edits log file containing the latest file system changes (similar to the transaction log in the RDBMS world). The job of the Secondary NameNode is in the periodic merge fsimage and edits - the Secondary NameNode maintains the size of the edits within reasonable limits. The Secondary NameNode is necessary for a quick manual recovery of the NameNode in case the NameNode fails.
In a real cluster, NameNode and Secondary NameNode are separate servers demanding memory and hard disk. And the declared “commodity hardware” is already a DataNode case.
DataNode - There are a lot of such nodes in a cluster. They store blocks of files directly. Noda regularly sends NameNode its status (shows that it is still alive) and hourly reports, information about all blocks stored on this node. This is necessary to maintain the desired level of replication.
Let's see how data is recorded in HDFS:

1. The client cuts the file into chains of block size.
2. The client connects to the NameNode and requests a write operation, sending the number of blocks and the required replication level
3. NameNode responds with a chain of DataNode.
4. The client connects to the first node from the chain (if it didn’t work out from the first, from the second, and so on. The client records the first block on the first node, the first node on the second, and so on.
5. Upon completion of the recording in the reverse order (4 -> 3, 3 -> 2, 2 -> 1, 1 -> to the client) messages about the successful recording are sent.
6. As soon as the client receives confirmation that the block has been successfully written, it notifies the NameNode of the block's record, then receives the DataNode chain for the second block, etc.
The client continues to write blocks if it is able to write successfully a block to at least one node, i.e., replication will work according to the well-known “eventual” principle, then NameNode will compensate and still achieve the desired level of replication.
Completing the review of HDFS and the cluster, let's pay attention to another great feature of Hadoop, rack awareness. The cluster can be configured so that NameNode has an idea of ​​which nodes on which rackes are located, thereby providing the best
protection against failures .
MapReduce
The job unit of job is a set of map (parallel data processing) and reduce (union of conclusions from a map) of tasks. Map tasks are performed by mappers, reduce - by reducer. Job consists of at least one mapper, reducer'y optional.
Here the problem of splitting the task into maps and reducees is discussed. If the words “map” and “reduce” are completely incomprehensible to you, you can see a
classic article on this topic.
MapReduce Model

- Data input / output occurs in the form of pairs (key, value)
- Two map functions are used: (K1, V1) -> ((K2, V2), (K3, V3), ...) - displaying a key-value pair to a certain set of intermediate key and value pairs, and also reduce: (K1 , (V2, V3, V4, VN)) -> (K1, V1), representing a set of values ​​that has a common key to a smaller set of values.
- Shuffle and sort is needed to sort the input to the reducer by key, in other words, it makes no sense to send the value (K1, V1) and (K1, V2) to two different reducer. They must be processed together.
Let's look at the architecture of MapReduce 1. First, let's expand the concept of the hadoop cluster by adding two new elements to the cluster - JobTracker and TaskTracker. JobTracker directly requests from clients and manages map / reduce tasks on TaskTrackers. JobTracker and NameNode are spread to different machines, while DataNode and TaskTracker are on the same machine.
The interaction between the client and the cluster is as follows:

1. Client sends job to JobTracker. Job is a jar file.
2. JobTracker is looking for TaskTrackers based on data locality, i.e. preferring those that already store data from HDFS. JobTracker assigns map and reduce tasks to tasktrackers
3. TaskTrackers send job performance reports to JobTracker.
Failure of the task - the expected behavior, failed task automatically restarted on other machines.
In Map / Reduce 2 (Apache YARN), JobTracker / TaskTracker terminology is no longer used. JobTracker is divided into
ResourceManager - resource management and
Application Master - application management (one of which is MapReduce itself). MapReduce v2 uses new API
Setting up the environment
There are several different Hadoop distributions on the market: Cloudera, HortonWorks, MapR - in order of popularity. However, we will not focus on the choice of a specific distribution. A detailed analysis of distributions can be found
here .
There are two ways to try Hadoop painlessly and with minimal effort.
1.
Amazon Cluster - a full cluster, but this option will cost money.
2. Download the virtual machine (
manual number 1 or
manual number 2 ). In this case, the downside is that all servers in the cluster are spinning on the same machine.
We turn to painful ways. Hadoop first version in Windows will require the installation of Cygwin. Plus there will be excellent integration with development environments (IntellijIDEA and Eclipse). More in this wonderful
manual .
Starting with version two, Hadoop also supports Windows server editions. However, I would not advise trying to use Hadoop and Windows not only in production, but generally outside the developer’s computer, although there are
special distributions for this. Windows 7 and 8 are not currently supported by vendors, but people who love a challenge can try to
do it by hand .
I also note that for Spring fans there is a
Spring for Apache Hadoop framework.
We will go simple and install Hadoop on a virtual machine. First, let's download
the CDH-5.1
distribution for the virtual machine (VMWare or VirtualBox). The size of the distribution is about 3.5 GB. Download, unpack, upload to VM and that's it. We have everything. It's time to write your favorite WordCount!
Specific example
We need sample data. I suggest to download
any dictionary for bruteforce passwords. My file will be called john.txt.
Now open Eclipse, and we already have a newly created training project. The project already contains all the necessary libraries for development. Let's throw out all the code carefully put out by the guys from Clouder and copy the following:
package com.hadoop.wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
We get something like this:

At the root of the training project, add the mail john.txt via the menu File -> New File. Result:

Click Run -> Edit Configurations and enter as input Program.txt and output as Program Arguments

Click Apply, and then Run. Work successfully completed:

And where are the results? To do this, update the project in Eclipse (using the F5 button):

In the output folder you can see two files: _SUCCESS, which says that the work was completed successfully, and part-00000 directly with the results.
This code, of course, can be debugged, and so on. We end the conversation with an overview of the unit tests. Actually, for the time being, only the MRUnit framework (https://mrunit.apache.org/) has been written for writing unit tests in Hadoop, it is late for Hadoop: versions up to 2.3.0 are now supported, although the latest stable version of Hadoop is 2.5.0
Ecosystem Blitz Overview: Hive, Pig, Oozie, Sqoop, Flume
In a nutshell and everything.
Hive & Pig . In most cases, writing a Map / Reduce job in pure Java is too time consuming and overworking, which makes sense, as a rule, only to pull out all possible performance. Hive and Pig are two tools for this case. Hive love in Facebook, Pig love Yahoo. Hive has a SQL-like syntax (
similarities and differences with SQL-92 ). Many people have moved to the Big Data camp with experience in business analysis, as well as DBA - for them Hive is often the tool of choice. Pig focuses on ETL.
Oozie is a workflow engine for jobs. Allows you to build jobs on different platforms: Java, Hive, Pig, etc.
Finally, frameworks that provide direct data entry into the system. Very short.
Sqoop - integration with structured data (RDBMS),
Flume - with unstructured.
Review of literature and video courses
Literature on Hadoop is not so much. As for the second version, I came across only one book that would concentrate on it -
Hadoop 2 Essentials: An End-to-End Approach . Unfortunately, the book does not get in electronic format, and read it did not work.
I do not consider the literature on individual components of the ecosystem - Hive, Pig, Sqoop - because it is somewhat outdated, and most importantly, such books are unlikely to be read from cover to cover, rather, they will be used as a reference guide. And then you can always do documentation.
Hadoop: The Definitive Guide is a book in the Amazon Top and has many positive reviews. The material is outdated: 2012 and describes Hadoop 1. Plus, there are many positive reviews and a fairly wide coverage of the entire ecosystem.
Lublinskiy B. Professional Hadoop Solution is a book from which a lot of material is taken for this article. Somewhat complicated, but a lot of real practical examples — attention is paid to the specific nuances of building solutions. Much nicer than just reading the description of the features of the product.
Sammer E. Hadoop Operations - about half of the book is devoted to the description of the configuration of Hadoop. Given that the book is 2012, will become obsolete very soon. It is intended primarily for devOps, of course. But I am of the opinion that it is impossible to understand and feel the system if it is only developed and not operated. The book seemed useful to me due to the fact that the standard problems of backup, monitoring and benchmarking of the cluster were discussed.
Parsian M. "Data Algorithms: Recipes for Scaling up with Hadoop and Spark" - the main focus is on the design of Map-Reduce-applications. A strong bias in the scientific side. Useful for a comprehensive and deep understanding and application of MapReduce.
Owens J. Hadoop Real World Solutions Cookbook - like many other books published by Packt with the word “Cookbook” in the title, is a technical documentation that has been cut into questions and answers. This is also not so simple. Try it yourself. Worth reading for a broad overview, well, and use as a reference.
It is worth paying attention to two video courses from O'Reilly.
Learning Hadoop - 8 hours. Seemed too superficial. But for me some added value added. materials, because I want to play with Hadoop, but we need some live data. And here it is - a great source of data.
Building Hadoop Clusters - 2.5 hours. As is clear from the title, here is the emphasis on building clusters on Amazon. I really liked the course - short and clear.
I hope that my humble contribution will help those who are just starting to learn Hadoop.