Hadoop, java MapReduce: launch from arbitrary web / EE container

There are quite a few examples on the Internet about how to launch MapReduce from stand-alone applications on Java.
But beginning to work with the Indian elephant can be difficult to understand how to run the job from some java container.

For example, this tutorial , courtesy of ikrumping , contains the following code example:

Job job = new Job(config, "grep"); /* *     jar-    *    . */ job.setJarByClass(Grep.class);

This code will work if you run the stand-alone application:
')
If you run the code from JBOSS AS, WebSphere AS, Glassfish AS, etc., then this code will not work.
Why? Yes, because the container unpacks your JAR file into its own caches and runs the classes from there.

Who cares why the setJarByClass method does not work in the case of a container - I invite you under the spoiler

First of all, I suggest looking at the implementation of the setJarByClass method.

 public void setJarByClass(Class cls) { String jar = findContainingJar(cls); if (jar != null) setJar(jar); } private static String findContainingJar(Class my_class) { ClassLoader loader = my_class.getClassLoader(); String class_file = my_class.getName().replaceAll("\\.", "/") + ".class"; try { Enumeration itr = loader.getResources(class_file); while (itr.hasMoreElements()) { URL url = (URL)itr.nextElement(); if ("jar".equals(url.getProtocol())) { String toReturn = url.getPath(); if (toReturn.startsWith("file:")) { toReturn = toReturn.substring("file:".length()); } toReturn = toReturn.replaceAll("\\+", "%2B"); toReturn = URLDecoder.decode(toReturn, "UTF-8"); return toReturn.replaceAll("!.*$", ""); } } } catch (IOException e) { throw new RuntimeException(e); } return null; }

As you can see, the findContainingJar method expects the protocol type of the URL to be “jar”.
And in the case of each container, the protocol will be its own.
As a result: the setJarByClass method works mainly for desktop applications only.

How to launch Mapreduce Joba in a universal way, independent of the specific application container?

To do this, follow these steps:

create a separate JAR containing all the classes used from Joba
applaud it to the hudup hdfs file system where you are going to run MapReduce
add JAR archive in classpath

In the above example, you need to replace:

  job.setJarByClass(Grep.class);

  DistributedCache.addFileToClassPath("/user/UserName/test.jar", config);

Where the first parameter of the addFileToClassPath method contains the path to the JAR file within the HDFS distributed file system.
And the second is the hadup configuration (org.apache.hadoop.conf.Configuration).

Previously, there were 2 more ways to slip your jarku Hadupa, but they were already outdated: blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job

Source: https://habr.com/ru/post/235253/

All Articles

Hadoop, java MapReduce: launch from arbitrary web / EE container

How to launch Mapreduce Joba in a universal way, independent of the specific application container?

More articles: