📜 ⬆️ ⬇️

Let's talk for Hadoop

image

Introduction

As a person with a not very stable psyche, one glance at a picture like this is enough for me to start a panic attack. But I decided that I would only suffer myself. The purpose of the article is to make Hadoop look not so scary.

What will happen in this article:


')
What will not be in this article:



What is Hadoop and why is it needed?

Hadoop is not that complicated; the kernel consists of the HDFS file system and the MapReduce framework for processing data from this file system.

If you look at the question “why do we need Hadoop?” From the point of view of use in a large enterprise, then there are quite a lot of answers, and they vary from “strongly for” to “very against”. I recommend viewing the ThoughtWorks article.

If you look at the same question from a technical point of view - for which tasks it makes sense for us to use Hadoop - this is also not so simple. The manuals first understand two basic examples: word count and log analysis. Well, what if I don’t have word count or log analysis?

It would be nice to determine the answer as something simple. For example, SQL - you need to use if you have a lot of structured data and you really want to talk with the data. Ask as many questions as possible in advance of an unknown nature and format.

The long answer is to look at a number of existing solutions and implicitly assemble, under the subcortex, the conditions for which Hadoop is needed. You can poke around on blogs, I can still advise you to read the Mahmoud Parsian book Data Algorithms: Recipes for Scaling up with Hadoop and Spark .

I'll try to answer shorter. Hadoop should be used if:



Hadoop should not be used:


HDFS architecture and typical Hadoop cluster

HDFS is similar to other traditional file systems: files are stored as blocks, there is a mapping between blocks and file names, a tree structure is supported, a rights-based access model is supported, etc.

Differences HDFS:


Hadoop cluster consists of three types of nodes: NameNode, Secondary NameNode, Datanode.

Namenode is the brain of the system. As a rule, one node per cluster (more in the case of the Namenode Federation, but we leave this case overboard). It stores all the metadata of the system - directly mapping between files and blocks. If node 1, then it is also Single Point of Failure. This problem is solved in the second version of Hadoop using the Namenode Federation .

Secondary NameNode - 1 node per cluster. It is customary to say that “Secondary NameNode” is one of the most unfortunate names in the entire history of programs. Indeed, the Secondary NameNode is not a replica of the NameNode. The state of the file system is stored directly in the fsimage file and in the edits log file containing the latest file system changes (similar to the transaction log in the RDBMS world). The job of the Secondary NameNode is in the periodic merge fsimage and edits - the Secondary NameNode maintains the size of the edits within reasonable limits. The Secondary NameNode is necessary for a quick manual recovery of the NameNode in case the NameNode fails.

In a real cluster, NameNode and Secondary NameNode are separate servers demanding memory and hard disk. And the declared “commodity hardware” is already a DataNode case.

DataNode - There are a lot of such nodes in a cluster. They store blocks of files directly. Noda regularly sends NameNode its status (shows that it is still alive) and hourly reports, information about all blocks stored on this node. This is necessary to maintain the desired level of replication.

Let's see how data is recorded in HDFS:
image

1. The client cuts the file into chains of block size.
2. The client connects to the NameNode and requests a write operation, sending the number of blocks and the required replication level
3. NameNode responds with a chain of DataNode.
4. The client connects to the first node from the chain (if it didn’t work out from the first, from the second, and so on. The client records the first block on the first node, the first node on the second, and so on.
5. Upon completion of the recording in the reverse order (4 -> 3, 3 -> 2, 2 -> 1, 1 -> to the client) messages about the successful recording are sent.
6. As soon as the client receives confirmation that the block has been successfully written, it notifies the NameNode of the block's record, then receives the DataNode chain for the second block, etc.

The client continues to write blocks if it is able to write successfully a block to at least one node, i.e., replication will work according to the well-known “eventual” principle, then NameNode will compensate and still achieve the desired level of replication.
Completing the review of HDFS and the cluster, let's pay attention to another great feature of Hadoop, rack awareness. The cluster can be configured so that NameNode has an idea of ​​which nodes on which rackes are located, thereby providing the best protection against failures .

MapReduce

The job unit of job is a set of map (parallel data processing) and reduce (union of conclusions from a map) of tasks. Map tasks are performed by mappers, reduce - by reducer. Job consists of at least one mapper, reducer'y optional. Here the problem of splitting the task into maps and reducees is discussed. If the words “map” and “reduce” are completely incomprehensible to you, you can see a classic article on this topic.

MapReduce Model

image



Let's look at the architecture of MapReduce 1. First, let's expand the concept of the hadoop cluster by adding two new elements to the cluster - JobTracker and TaskTracker. JobTracker directly requests from clients and manages map / reduce tasks on TaskTrackers. JobTracker and NameNode are spread to different machines, while DataNode and TaskTracker are on the same machine.

The interaction between the client and the cluster is as follows:

image

1. Client sends job to JobTracker. Job is a jar file.
2. JobTracker is looking for TaskTrackers based on data locality, i.e. preferring those that already store data from HDFS. JobTracker assigns map and reduce tasks to tasktrackers
3. TaskTrackers send job performance reports to JobTracker.

Failure of the task - the expected behavior, failed task automatically restarted on other machines.
In Map / Reduce 2 (Apache YARN), JobTracker / TaskTracker terminology is no longer used. JobTracker is divided into ResourceManager - resource management and Application Master - application management (one of which is MapReduce itself). MapReduce v2 uses new API

Setting up the environment

There are several different Hadoop distributions on the market: Cloudera, HortonWorks, MapR - in order of popularity. However, we will not focus on the choice of a specific distribution. A detailed analysis of distributions can be found here .

There are two ways to try Hadoop painlessly and with minimal effort.

1. Amazon Cluster - a full cluster, but this option will cost money.

2. Download the virtual machine ( manual number 1 or manual number 2 ). In this case, the downside is that all servers in the cluster are spinning on the same machine.

We turn to painful ways. Hadoop first version in Windows will require the installation of Cygwin. Plus there will be excellent integration with development environments (IntellijIDEA and Eclipse). More in this wonderful manual .

Starting with version two, Hadoop also supports Windows server editions. However, I would not advise trying to use Hadoop and Windows not only in production, but generally outside the developer’s computer, although there are special distributions for this. Windows 7 and 8 are not currently supported by vendors, but people who love a challenge can try to do it by hand .

I also note that for Spring fans there is a Spring for Apache Hadoop framework.

We will go simple and install Hadoop on a virtual machine. First, let's download the CDH-5.1 distribution for the virtual machine (VMWare or VirtualBox). The size of the distribution is about 3.5 GB. Download, unpack, upload to VM and that's it. We have everything. It's time to write your favorite WordCount!

Specific example

We need sample data. I suggest to download any dictionary for bruteforce passwords. My file will be called john.txt.
Now open Eclipse, and we already have a newly created training project. The project already contains all the necessary libraries for development. Let's throw out all the code carefully put out by the guys from Clouder and copy the following:

package com.hadoop.wordcount; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } 

We get something like this:

image

At the root of the training project, add the mail john.txt via the menu File -> New File. Result:

image

Click Run -> Edit Configurations and enter as input Program.txt and output as Program Arguments

image

Click Apply, and then Run. Work successfully completed:

image

And where are the results? To do this, update the project in Eclipse (using the F5 button):

image

In the output folder you can see two files: _SUCCESS, which says that the work was completed successfully, and part-00000 directly with the results.
This code, of course, can be debugged, and so on. We end the conversation with an overview of the unit tests. Actually, for the time being, only the MRUnit framework (https://mrunit.apache.org/) has been written for writing unit tests in Hadoop, it is late for Hadoop: versions up to 2.3.0 are now supported, although the latest stable version of Hadoop is 2.5.0

Ecosystem Blitz Overview: Hive, Pig, Oozie, Sqoop, Flume

In a nutshell and everything.

Hive & Pig . In most cases, writing a Map / Reduce job in pure Java is too time consuming and overworking, which makes sense, as a rule, only to pull out all possible performance. Hive and Pig are two tools for this case. Hive love in Facebook, Pig love Yahoo. Hive has a SQL-like syntax ( similarities and differences with SQL-92 ). Many people have moved to the Big Data camp with experience in business analysis, as well as DBA - for them Hive is often the tool of choice. Pig focuses on ETL.

Oozie is a workflow engine for jobs. Allows you to build jobs on different platforms: Java, Hive, Pig, etc.

Finally, frameworks that provide direct data entry into the system. Very short. Sqoop - integration with structured data (RDBMS), Flume - with unstructured.

Review of literature and video courses

Literature on Hadoop is not so much. As for the second version, I came across only one book that would concentrate on it - Hadoop 2 Essentials: An End-to-End Approach . Unfortunately, the book does not get in electronic format, and read it did not work.

I do not consider the literature on individual components of the ecosystem - Hive, Pig, Sqoop - because it is somewhat outdated, and most importantly, such books are unlikely to be read from cover to cover, rather, they will be used as a reference guide. And then you can always do documentation.

Hadoop: The Definitive Guide is a book in the Amazon Top and has many positive reviews. The material is outdated: 2012 and describes Hadoop 1. Plus, there are many positive reviews and a fairly wide coverage of the entire ecosystem.

Lublinskiy B. Professional Hadoop Solution is a book from which a lot of material is taken for this article. Somewhat complicated, but a lot of real practical examples — attention is paid to the specific nuances of building solutions. Much nicer than just reading the description of the features of the product.

Sammer E. Hadoop Operations - about half of the book is devoted to the description of the configuration of Hadoop. Given that the book is 2012, will become obsolete very soon. It is intended primarily for devOps, of course. But I am of the opinion that it is impossible to understand and feel the system if it is only developed and not operated. The book seemed useful to me due to the fact that the standard problems of backup, monitoring and benchmarking of the cluster were discussed.

Parsian M. "Data Algorithms: Recipes for Scaling up with Hadoop and Spark" - the main focus is on the design of Map-Reduce-applications. A strong bias in the scientific side. Useful for a comprehensive and deep understanding and application of MapReduce.

Owens J. Hadoop Real World Solutions Cookbook - like many other books published by Packt with the word “Cookbook” in the title, is a technical documentation that has been cut into questions and answers. This is also not so simple. Try it yourself. Worth reading for a broad overview, well, and use as a reference.

It is worth paying attention to two video courses from O'Reilly.

Learning Hadoop - 8 hours. Seemed too superficial. But for me some added value added. materials, because I want to play with Hadoop, but we need some live data. And here it is - a great source of data.

Building Hadoop Clusters - 2.5 hours. As is clear from the title, here is the emphasis on building clusters on Amazon. I really liked the course - short and clear.
I hope that my humble contribution will help those who are just starting to learn Hadoop.

Source: https://habr.com/ru/post/234993/


All Articles