Debugging Hadoop Applications

No matter how much they say that the logs are able to completely replace debugging, alas, and ah - this is not quite so, and sometimes - not at all. Indeed, sometimes it does not even occur to me that it was this variable that had to be written to the log — at the same time, in debug mode, you can often view several data structures at once; you can, in the end, stumble on a problem area completely by accident. Therefore, sometimes debugging is inevitable, and often it can save a lot of time.

Debugging a single-threaded Java application is simple. Debugging a multi-threaded Java application is a bit more complicated, but still easy. To debug a multiprocess Java application? With processes running on different machines? It is definitely more difficult. That is why all the Hadoop manuals recommend turning to debugging only and exclusively when the other options (read: logging) are exhausted and do not help. The situation is often complicated by the fact that on large clusters you may not have access to specific map / reduce nodes (this is exactly what I encountered). But let's solve the problem in parts. So…

Scenario One: Local Hadoop

The easiest option of all. Local installation of Hadoop - everything is performed on the same machine, and moreover - in the same process, but in different threads. Debugging is equivalent to debugging a regular multithreaded Java application - what could be more trivial?
')
How to achieve this? We go to the directory where we have our local Hadoop deployed (I suppose that you can do this or can read the corresponding instructions and can deal with it now).

     $ cd ~ / dev / hadoop
     $ cp bin / hadoop bin / hdebug
     $ vim bin / hdebug

Our task is to add another JVM option, somewhere around the 282-283 line (depending on the version, the number may change) right after the script has finished creating $HADOOP_OPTS :

     HADOOP_OPTS = "$ HADOOP_OPTS -Xdebug -Xrunjdwp: transport = dt_socket, address = 1044, server = y, suspend = y"

What did we say with this spell? We said that we want to start a Java machine with support for a remote debugger, which will have to connect to port 1044, and until it connects, the program will be suspended (immediately after the start). It's simple, right? It remains to go to Eclipse (or any other Java IDE) and add a remote debugging session there. Well, for example, like this:

If you start debugging not on the local machine, but on some server (which is quite normal and usually I do it - do I need it, load my laptop with such things when there are special server devs with a wild amount of memory and many processors? ), then just change localhost to the required host. Next, set breakpoint in the program code (for the beginning, in the main body, the so-called gateway code), and start Hadoop:

     bin / hadoop jar myApplication.jar com.company.project.Application param1 param2 param3

Hadup starts and does nothing, the JVM reports that it is waiting for a connection:

     Listening for transport dt_socket at address: 1044

Everything, now connect to the debugging session with Eclipse - and wait for your breakpoint to “pop up”. Debugging of map / reduce classes does not differ in anything (within the framework of this scenario) from debugging gateway code.

Scenario Two: pseudo-distributed mode

In pseudo-distributed mode, Hadoop is closer to the state in which it will work in production (and we, accordingly, are closer to the headache that we will have when something does break). The difference is that map / reduce tasks are performed in separate processes, HDFS is already “real” (and not imitated by the local file system - although it is located on just one node), and it is harder to debug. That is, you can, of course, apply the approach described above to the start of new processes, but I did not try to do this and there is a suspicion that without modifying the code of Hadoop itself, it simply will not work.

There are two main approaches to solving the debugging problem in this mode - one of them is using a local task tracker, and the other is IsolationRunner . At once, we say that the first option can only give very approximate results, since the nodes will be different and all the code will be executed in one process (as in the previous version). The second option gives a very accurate approximation of the real work, but - alas, it is not possible if you do not have access to specific Task Nodes (which is very, very likely in the case of large production clusters).

So, IsolationRunner is an auxiliary class that allows you to simulate as accurately as possible the process of performing a specific task (Task), in fact, repeating its execution. In order to use it, you need to make a few gestures:

Set the keep.failed.tasks.files value to true in the job configuration. Depending on how you form your task, this can be done either by editing the XML file, or in the program text, but, in any case, it is not difficult. This way you give instructions to the task tracker that if the task was completed with an error, it is not necessary to delete its configuration and data.
The official guide further recommends that you go to the node where the task was completed with an error. In our case, this is still our local (or not very local, but still only one) host; The directory where taskTracker is located depends on the configuration, but in the “default” mode it will most likely have hadoop/bin . Brazenly copy the example from the manual:
```
     $ cd <local path> / taskTracker / $ {taskid} / work 
     $ bin / hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml 
```
In our case, we can and should replace bin/hadoop with bin/hdebug , which we created above (well, it is clear that the relative path will be some other way ;-)). As a result - we are debugging the fallen task, working with exactly the data that led to its fall. Simple, beautiful, comfortable.
We are connected by a debugger, we act, we find an error, we are happy

As I noted above, this method will work in pseudo-distributed mode, but it will also work fine if you have access to the map / reduce nodes (this is almost an ideal situation). But there is also a third scenario, the most difficult ...

Scenario Three: production cluster

You are on a real, "militant" cluster, in which there are over a thousand machines, and no one will give you the right to log in to one of them, except gateways machines specially designed for this purpose (hence the name gateway code, because it is used for ). All grown-up, although you are still working on a test “small” data set of half a terabyte, but the headache with debugging is already getting up to its full height. Here, perhaps, it would be appropriate to repeat the advice about logging - if you have not done it yet, do it. At a minimum, this will allow you to partially localize the gap, and perhaps you will be able to use one of the above two methods to identify and solve it. There is no simple debugging method that will give predictable results. Strictly speaking, all you can do is make your task run locally, on the gateway — in this case, you will work with real production HDFS, but you will most likely not be able to reproduce the behavior on specific nodes.

All you need to do to perform a local task is to set mapred.job.tracker to local (in the task configuration). Now with some luck, you can connect with a debugger (however, most likely not with your favorite Eclipse, but with something console running on the same network - or running through an SSH tunnel, if available) and run your code on the gateway. From the pros - you work with real data on this HDFS. Of the minuses - if the error is floating and is reproduced, God forbid, only on one or two nodes of the cluster, the line with two you will find it. The bottom line is the best thing you can do without communicating with the cluster support team and requesting temporary access to a specific node.

Conclusion

So, here is a little bit of information to think about how difficult debugging map / reduce applications can be. Without denying all of the above, I would recommend the following strategy:

Arrange debugging messages throughout the task code - first rarely (approximately), then more often, when it becomes clear that the task is falling, say, between lines 100 and 350
Try to reproduce the problem on the local Hadup - very often it will succeed, and you can figure out what's wrong.
Check whether you have access to specific map / reduce nodes. Find out whether it is realistic to get it (at least temporarily and not at all)
Use debugging with a close look! Seriously, it works - only better if the look is not yours: often a third-party person can see a bug in three seconds, when you have already “looked narrowly” at it and you don’t notice at close range
If all else fails, use different combinations of the above techniques. At least one of them will work necessarily.

Successful debugging!

PS written based on my old post and weekly debugging hadoop applications, java library and JNI supporting it. In addition, greetings to Umputun and Bobuk, who, with their release of 179 RT and the tale of map / reduce, reminded me that there are things (even a few of them) that I understand and can tell about them ;-)

Source: https://habr.com/ru/post/89365/

All Articles

Debugging Hadoop Applications

Scenario One: Local Hadoop

Scenario Two: pseudo-distributed mode

Scenario Three: production cluster

Conclusion

More articles: