In this post, I described my experience in dealing with one particular leak in a large project. Many mistakes were made, but it may be useful to someone.For a long time I did not encounter memory leaks, but the other day our plus demon began to flow. At what valgrind.memcheck does not show anything intelligible - the server starts for a long time, it does not give a big load - watchdog, which checks for server hangs under such a load under valgrind - quickly nailed the server.
It flows strongly - 7 gigs in 3 days. Although earlier in the operating mode, more than 1 gig did not require (it was launched even for 2 weeks without restarting).
')
It was thought that either cyclic references or data from some kind of container are not deleted.
Well, I decided to use the good old google perf tools to take a snapshot of the memory at the time of work. Hm Under ubuntu amd64 for some reason only the old version. And it hangs when the application is unpaired (although google perf tools is specifically made for multi-threaded applications).
In the debager you can see that all the threads are hanging on the backs.
I decided to upgrade to the latest version - but I did not want to compile the version from the site. I found packages on packages.debian.org - and surprisingly everything was installed without problems.
But when entering the multi-threaded daemon, it is again hanging.
Yes, and the demon itself is loaded under 15-20 minutes - since it took the base from the production, and there is a lot of data.
I looked for some good debugging allocators - the one I needed (google perf tools - heap peofiler, which wouldn't hang) did not exist.
Well, maybe there are some keys in valgrind.memcheck to dump the memory state not at the end of the program, but in the middle? Not. But ... There are valgrind.massif - which is what it does. Hooray!
Ruthlessly clearing the database of data (so that at least the load tests could work, and they require a real game and at least some valid data), I started picking valgrind.massif.
I started it and drove the load tests - everything works, nothing flows. Very interesting.
In general, the tulza did not give me any answer - probably the synthetic load did not provoke an error involving a memory leak. It is necessary to include mosk;)
Unfortunately, since the last stable pouring, a lot of code has changed, and I was to blame: I did refactoring, affecting almost half of the whole code;)
I thought of memory and see what was there, but I did not want to understand the 7 gig file. And here - a crazy idea came - to analyze statistics on demons (for the benefit of the current demons, we have 3 in the cluster) that I could collect from both the OS and the internal monitoring of demons.
Quite by chance I saw the proportion between the amount of leaked memory and the number of executed commands of the same type. The team was old, but I remembered that in this pouring we began to use it much more actively. I checked the proportion on the calculator (python in the shell;) - everything is the same. This is eating memory.
The problem is that a similar team was used only in billing, there was no load test for it. And recently they made a similar command that uses similar features as the first command, but it is executed many times more. It is used exclusively in social networks and has not been added to the load test either. This is such a bad situation.
Having added this command to the load test - in 10 seconds I consumed 50% of the memory of my laptop. And then the matter of technology - I didn’t even use massif - I read the code - I found a place where a cyclic link might not be created - I changed the policy from Strong to Weak - I crashed it - I ran the test - hurray, the scouts died.
Well, what conclusions did I draw from the dead day for debug? Actually here:
- It is necessary to drive tests for memory leaks more often.
- Even if the leak is small - in the future the code may start to be used more actively - then a small leak will turn into a severe headache.
- It is necessary to have a representative database of functional similar production, but with a smaller amount of data. Without this, debugging under the memory debugger will be almost impossible.
- I made sure once again - the absence of bare pointers does not guarantee the absence of leaks (this applies to both java and c #, as well as using smart pointers in C ++)
- When nothing helps, chance can solve everything.
- Accidents are not accidental © =) - without looking for some regularities in statistics, I would not find them, although I did not hope to find something