The story of one garbage collection

This enlightening story tells how important it is to develop googling skills and how I struggled with the hourly full Garbage collection.

Brief description of the problem

After we migrated in production one of the system components (the only one working on Tomcat) to the new version of Tomcate, the support suddenly panicked, seeing half-second GC launches in the logs.

In general, in our project (FX trading platform in an investment bank) any launch of GC for more than 50 milliseconds (ms) causes concern for the support that monitors, among other things, the GC logs, and GC for more than 100 ms causes hysteria. Therefore, having seen 1.45 seconds in the logs, they simply panicked. I was lucky to be on the “second line of defense” that day, and I started to figure out what the matter was.

Investigation progress

For a start, got into the logs. And saw:
[Full GC (System) 25.575: [CMS: 20700K->22393K(655360K) <...>

The key here is an indication that this is System - i.e. full garbage collection is activated by calling System.gc() . I think it would be superfluous to say that this is not used in our code.
')

Lyrical digression

To be completely honest, this is not true - we use the call System.gc() , with quite a good purpose. The fact is that during the launch process, the system pre-caches a huge amount of “static data” - information about users, currencies, etc., and this process generates quite a lot of garbage. Therefore, after we have finished all the preparations, and immediately before we report that the system has started, we call System.gc() to at the same time get rid of unnecessary memory fragmentation. In addition, we do the same thing every night at 5:00 pm New York time, when the trading day ends, and our entire system goes offline for 5 minutes (maintenance mode), in order to remove the garbage accumulated during the day and defragment the memory . During working hours, we try to avoid the old-gen collection, since even ParNew assemblies that take 20-40 ms sometimes create unallowable delays.

First of all, the old memory was checked by the DGC (reference to the anti-Mongolian, briefly the garbage collector activated when used in the RMI application). Once, it was he who caused the complete builds every hour on many of our systems, for the simple reason that we use RMI for JMX. Therefore, one fine day, many years ago, we on all systems added the following to the JVM launch parameters:

 -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE

Sources of these settings: [1] , [2] . All that they are called to change is to tell the DGC that it can be run once in 0x7FFFFFFFFFFFFFFE milliseconds (approximately once in 292471208 years).

After checking the start script of Tomkata, I saw that there was a small error there, and instead of 0x7FFFFFFFFFFFFFFE , 0x7FFFFFFFFFFFFFFF used. The boundary value causes an IllegalArgumentException (code from the sun.misc.GC class):

 if (this.latency == 9223372036854775807L) { throw new IllegalStateException("Request already cancelled"); }

We need to go deeper!

However, correcting this misunderstanding, I was very surprised when, after restarting the tomkats, we again saw all the same hourly Full GC (System) ! I had to strain my mind. To find out who is to blame for the System.gc() call, I quickly wrote my implementation of the java.lang.Runtime class, copying the standard class and changing its gc() method (which is called when System.gc() called).
Original method:

 public native void gc();

Modified method:

 public void gc() { Thread.dumpStack(); }

As you can see, everything that we are doing here - we will find out who caused us all the same, by outputting the spectra to stdout , since all that the Thread.dumpStack() method Thread.dumpStack() is create an exception and output it to the framework:

  /** * Prints a stack trace of the current thread to the standard error stream. * This method is used only for debugging. * * @see Throwable#printStackTrace() */ public static void dumpStack() { new Exception("Stack trace").printStackTrace(); }

We substitute the java.lang.Runtime class in our java machine with -Xbootclasspath/p:/tmp/runtime , run ... And what do we see?

 java.lang.Exception: Stack trace at java.lang.Thread.dumpStack(Thread.java:1249) at java.lang.Runtime.gc(Runtime.java:689) at java.lang.System.gc(System.java:926) at sun.misc.GC$Daemon.run(GC.java:92)

The same DGC! But how? Did we actually set an infinite interval between DGC launches? Here I had to strain all my googling skills. And so I searched, and that way - I could not find anything. I checked if anyone overwrites my settings for the DGC - it turned out that by the time GC was launched, all system properties had the same values as I set. Surprise…

... Deeper!

At this point, my suspicion fell on the volume. And so, the happy combination of words in the request to Google did give me one single link that made me say what I was very ashamed to type here, and my English colleagues (who helped me, as far as possible, in the investigation, by asking guiding questions like "Did you try to google?") looked at me very disapprovingly ... Then they looked at what was written on the link - and repeated my curses.
So, here it is, the link , which, in turn, leads to tomkata bugzilla: tynts. In short, these brilliant guys from Apache, through reflection, overwrite the value of the interval with which the DGC should be called, right in the sun.rmi.GC class, and despite the fact that we set our -Dsun.rmi.dgc.* Properties, they are still have no effect! For we use tomkat 6.0.35, and this bug was fixed in the next version, 6.0.36.

Here is the code from the Tomcat class JreMemoryLeakPreventionListener , which is actually responsible for this behavior:

 /* * Several components end up calling: * sun.misc.GC.requestLatency(long) * * Those libraries / components known to trigger memory leaks * due to eventual calls to requestLatency(long) are: * - javax.management.remote.rmi.RMIConnectorServer.start() */ if (gcDaemonProtection) { try { Class<?> clazz = Class.forName("sun.misc.GC"); Method method = clazz.getDeclaredMethod("requestLatency", new Class[] {long.class}); method.invoke(null, Long.valueOf(3600000)); } catch (...) { ... } }

In the same place, on SO, I found and workaround:

set />

That is what I did. As you can see from the code above, the gcDaemonProtection flag simply disables this block of code responsible for the strange behavior. And - oh, a miracle! - That helped! Hourly garbage collection disappeared into oblivion, the support is happy, I went to drink tea.

findings

And the conclusions, in general, are few:

The most important thing is to learn how to formulate requests to Google! I tried with a dozen different combinations, until finally I came across a completely banal query tomcat hourly full GC , which gave me the desired link.
No need to imply that third-party, even if very well-known software, does not have bugs - it even has! And this is not the first time when we run into bugs. Last time it was a bug in Hotspot CMS, which very badly spoiled our time for garbage collection. I decided to upgrade to a new JVM.
And, of course, reflection is evil, especially if you are a developer of a container or a library, and use your ~~dirty hands to~~ reflect on system classes.

Thank you for your attention, I hope someone my experience will be useful.

Source: https://habr.com/ru/post/179309/

All Articles