Hadoop Tutorial. Writing your grep

Good day, expensive habrasoobschestvu. Not long ago, I started learning how to work with big data (Map / Reduce, NoSQL ...) and very quickly learned about the open source framework Apache Hadoop , which I immediately began to learn.

This post is designed for beginners, who also recently began to learn Hadoop. In the post, a small application built on this framework will be disassembled ( Hello World! A sort ). Who cares, welcome under cat.

This topic does not consider the process of installation, configuration and problems with the launch, but resources for the study you can see below. I used the following technologies in my work:

Linux Ubuntu 13.04;
Oracle Java 1.7;
Hadoop 1.1.2;
Intellij IDEA 12;

Since the word counter ( also known as Word Count ) was demonstrated in the overwhelming majority of tutorials, I decided to diversify this topic and, as an example, disassembled grep .
Our implementation will receive an input:

Folder with files (File) to search for matches by a regular expression;
Path to save results;
Regular expression ;

At the output we get the file (s) that contain the full paths to the files (keys) in which there were matches and strings (values) with these matches in the file.

The whole data processing process is based on the MapReduce paradigm. Its essence is that we divide all work into two stages: map and reduce.
So let's get started.
')

Map

In this step, we get the key and value as an argument. Further, these data are processed by submitting a list of keys and a list of values.
Our implementation of the map function:

/* * . *      API org.apache.hadoop.mapreduce.* */ import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; /* * LongWritable -   ( ). * Text -    (   ). * Text -   (   ). * Text -   (    ). */ public class RegexMapper extends Mapper<LongWritable, Text, Text, Text>{ private Pattern pattern; private Text keyOut; //          . /* *  setup()     map()(  ). *        map() . */ @Override public void setup(Context context) throws IOException{ /* *   ( ), *     Driver-(  ). */ pattern = Pattern.compile(context.getConfiguration().get("regex")); /*       (valueIn). */ Path filePath = ((FileSplit) context.getInputSplit()).getPath(); keyOut = new Text(filePath.toString()); } /* *  map() .       .   - *        (keyOut -   setup() * )        (valueIn -  *   ). */ @Override public void map(LongWritable key, Text valueIn, Context context) throws IOException, InterruptedException { Matcher matcher = pattern.matcher(valueIn.toString()); /* *         ,   - *      ,      . */ if (matcher.find()) context.write(keyOut, valueIn); //    } }

That's all. Go to reduce .

Reduce

At the reduce stage, we receive as an argument one key and all the corresponding values obtained at the output of the map method (s) for their subsequent processing. In our case, we got the path to the file where the text corresponding to the specified pattern (key) and the set of lines where matches were found (list of values) were found.

 import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; /*      Text */ public class RegexReducer extends Reducer<Text, Text, Text, Text> { /*          . */ @Override public void reduce(Text keyIn, Iterable<Text> valuesIn, Context context) throws IOException, InterruptedException { /*     StringBuilder. */ StringBuilder valueOut = new StringBuilder(); for(Text value: valuesIn) valueOut.append("\n" + value.toString()); valueOut.append("\n"); context.write(keyIn, new Text(valueOut.toString())); } }

With map and reduce sorted out. It remains to pack everything into a driver class and run it.

Driver

In the driver class, the task is set up (installation of the mapper and redser, such as input and output data, etc.).
In general, here:

 import com.petrez.mappers.RegexMapper; import com.petrez.reducers.RegexReducer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.ToolRunner; import java.io.IOException; public class Grep { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { if(args.length != 3) { System.out.println("Usage: <inDir> <outDir> <regex>"); ToolRunner.printGenericCommandUsage(System.out); System.exit(-1); } Configuration config = new Configuration(); /*     map()    regex. */ config.set("regex", args[2]); Job job = new Job(config, "grep"); /* *     jar-    *    . */ job.setJarByClass(Grep.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); /* * . TextInputFormat         *    map .   *     "\n". */ job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(RegexMapper.class); job.setReducerClass(RegexReducer.class); job.waitForCompletion(true); } }

It remains only to run.

Due to the fact that the implementation is designed for beginners, it is simplified to the detriment of efficiency, namely the creation of a separate mapper for each row. This option greatly simplifies the implementation of the maper and redser, but it is very memory-intensive. Please take into account.

Since I have packed everything into an executable jar file, you can run our program like this:

 <  hadoop>/bin/hadoop jar /home/hduser/HadoopGrep.jar <    > <   > < >

The path to save the results must be a non-existent directory. If you configured Hadoop in pseudo-distributed mode , then the data is stored in the HDFS file system and you still need to pull it out.

Materials for study:

Good Hadoop Installation Tutorial
I highly recommend articles on Yahoo (in my opinion, even better than on the project website)
Hadoop Stable Version API
MapReduce paradigm description (Google Research)

Thanks to all.

UPD: Changes are made to the notes of GreyCat .

Source: https://habr.com/ru/post/189798/

All Articles

Hadoop Tutorial. Writing your grep

Map

Reduce

Driver

It remains only to run.

More articles: