Modern approach to garbage collection

Recently, I have met quite a few articles in which the latest garbage collector in Go is not advancing in the best way for me. Some of the articles were written by the developers of the language itself, and their statements hinted at a radical breakthrough in garbage collection technology.

Here is the initial announcement of the introduction of the new collector, dated August 2015:
')

In Go, a garbage collector (GC) is created not only for 2015, but also for 2025, and even further ... The collector in Go 1.5 heralds the coming of the future, in which assembly pauses are no longer a barrier to switching to a secure language. This is the future in which applications can easily scale with the equipment, and as the power of the equipment grows, the garbage collector is no longer a deterrent to creating better, scalable software. Go is a good language to use for at least the next ten years.

The creators claim that they did not just solve the problem of pauses for garbage collection, but went much further:

One of the high-level ways to solve performance problems is to add GC settings (knobs), one for each problem. The programmer can change them by selecting the best combination for your application. The disadvantage of this approach is that with the introduction of one or two new settings every year, after ten years it will be necessary to legally regulate the work of people who will change these settings. Go did not go this way. Instead of a bunch of settings, we left one and called it GOGC.

Moreover, freed from the burden of supporting dozens of settings, developers can focus on improving the runtime of the application.

I have no doubt that many Go users were just happy to get a new approach to runtime in Go. But I have complaints about these statements: they look like an unreliable marketing bulshit. And since they are replayed over and over again on the Web, the time has come to understand them in detail.

The reality is that Go hasn't implemented any new ideas or research results in the garbage collector. As the authors themselves admit, it’s just a parallel-tagged and cleaned collector based on the ideas of the 1970s. This is remarkable only because the collector was designed to optimize the length of the pauses at the expense of all the other important characteristics of the collector. In technical and marketing materials, Go does not mention all these side effects, so many programmers are unaware of their existence. As a result, it seems that competitive languages are poorly designed slag. And the authors of Go only fuel these thoughts:

To create a collector for the next decade, we turned to algorithms from decades past. Our collector implements the tri-color (parallel-tagging and cleaning) algorithm proposed by Dijkstra in 1978 . This is a deliberate difference from the majority of modern assemblers of the "corporate" class , which, we believe, best meets the characteristics of modern equipment and the requirements for the level of delay in modern software.

You read all this, and the thought arises that over the past 40 years nothing better has been proposed in the field of "corporate" garbage collectors.

Introduction to Garbage Collection Theory

When developing a garbage collection algorithm, a number of factors must be considered:

Program bandwidth: how much will the algorithm slow down the speed of the program? Sometimes this is expressed as a percentage: the ratio of processor time spent on assembly to the time spent on useful work.
Collector Throughput: How much garbage can a collector clean up after a certain amount of CPU time?
Heap Redundancy: How much memory is above the theoretical minimum required by the collector? If during the assembly it places temporary structures in memory, will this not lead to a sharp increase in memory consumption by the program?
Pause duration: for how long the collector stops the program?
Pause frequency: how often does the collector stop the program?
Distribution of pauses: most pauses are very short and only a few very long? Or do you prefer to make the pause duration more uniform?
Memory placement performance: Is the placement in a new memory fast, slow, or unpredictable?
Compaction: Does the collector give a message about the absence of memory (out-of-memory, OOM), even if there is enough space to satisfy his request, but it is scattered around the heap in the form of small chunks? If it does not, the program may start to slow down and eventually just get up, even if there is enough memory to continue.
Multithreading: How efficiently does a collector use multi-core machines?
Scaling: how efficiently does the collector work as heaps grow?
Tuning: How difficult is it to set up a picker right out of the box, and also for optimal performance?
Warm-up time: is the algorithm self-adjusting based on measurements of its own behavior? If so, how long does it go into optimal operation?
Freeing the page: does the algorithm return unused memory to the operating system? If yes, then when?
Portability: Does the collector work on processor architectures that provide weaker memory consistency guarantees compared to x86?
Compatibility: with which languages and compilers does the builder work? Is it possible to run it with a language that was created without taking into account the use of collectors, for example C ++? Do I need to modify the compiler? If so, do you need to recompile the whole program and dependencies when changing the algorithm of the collector?

As you can see, there are many different factors to consider when designing a collector, and some of them affect the architecture of the wider ecosystem associated with your platform. And I'm not sure that I listed all the factors.

Due to the complexity of the design parameters space, garbage collection is a sub-area of computer science, richly covered in research papers. New algorithms are proposed and implemented regularly, in both academic and commercial environments. But, unfortunately, no one has yet created an algorithm suitable for all occasions.

Around solid tradeoffs

Let's deal with this in more detail.

The first garbage collection algorithms developed for single-processor computers and programs with small heaps. CPU and memory resources were expensive, and users were undemanding, so they were loyal to the pauses in the programs. The algorithms created at that time tried to use a smaller processor and minimize excess memory consumption on the heap. This meant that the collector did nothing as long as the program could store data in memory. Then she paused to complete the tagging and cleaning of the heap, in order to free some of the memory as soon as possible.

Old collectors have advantages: they are simple; do not slow down the program if they do not do their work; Do not lead to excessive memory consumption. Conservative collectors, such as Boehm GC , do not even need to make changes to the compiler or programming language! This makes them suitable for desktop applications (usually their heaps are small), including for AAA video games ; most of the memory in them is occupied by data files that do not need to be scanned.

Algorithms that are characterized by pauses of a full stop (Stop-the-world, STW) for tagging and cleaning are most often studied on computer science courses. Sometimes at the interviews I ask the candidates to talk a little about garbage collection. And most often they represent the collector as a black box, inside of which it is not known what is happening, or they believe that it uses very old technology.

The problem is that such simple pauses for tagging and cleaning are extremely poorly scaled. If you add cores and increase the ratios of heap volumes and memory locations, the algorithm stops working well. But sometimes, when small heaps are applied, even simple algorithms quite tolerably perform their task! In such situations, you can take advantage of such a collector and minimize the redundancy of memory consumption.

The other extreme is the use of heaps of hundreds of gigabytes in size on machines with dozens of cores. For example, on servers serving exchange transactions or search queries. In such situations, you need to make the pause as short as possible. And then algorithms may be preferable, generally slowing down the work of programs due to background garbage collection, but with very short pauses.

On powerful systems, you can also perform large batch jobs. For them, it is not the pauses that are important, but only the total execution time. In such cases, it is better to use an algorithm that maximizes throughput , that is, the ratio of useful work performed to the time spent on garbage collection.

Alas, there is not a single algorithm, completely perfect. Also, no language's runtime can determine if your program is a batch job or an interactive program that is sensitive to delays. It is this, and not the stupidity of the developers of runtime, that led to the appearance of “garbage collector settings”. This is a consequence of the fundamental limitations of computer science.

Generational hypothesis

In 1984, it was noticed that most of the placements in memory "die young", that is, become rubbish a very short time after placement. This observation, called the generational hypothesis, is one of the strongest empirical observations in the development of programming languages. The hypothesis for several decades has been confirmed for many different languages: functional, imperative, not having and having data types.

The generational hypothesis has benefited in the sense that garbage collection algorithms have begun to use its advantages. So there appeared generational collectors (generational collectors), which had a number of advantages in comparison with the old “stop-mark-clean” algorithms:

Collector throughput: they can collect much more garbage and much faster.
Memory placement performance: when placing a new memory, you no longer need to look for free slots in the heap. So accommodation, in fact, was free.
The bandwidth of the program: the memory fragments began to be placed neatly next to each other, which greatly improved the efficiency of using the cache . Generation-based collectors require the program to do a certain amount of work as it is done, but in practice this is outweighed by the benefits of changes in cache performance.
Pause duration: most pauses (but not all) become shorter.

But such collectors have drawbacks:

Compatibility: algorithms move data in memory and make additional manipulations when the program in some cases writes to the pointer. This means that the builder must be tightly integrated with the compiler. For C ++, there are no generation based collectors.
Heap redundancy: such collectors copy fragments of memory back and forth between different "spaces". Since there should be enough space for copying, there is some heap size redundancy. In addition, such collectors require the support of various pointer maps ( memorized sets are remembered sets), which further enhances redundancy.
Distribution of pauses: although many pauses are very short, sometimes full stops are required for tagging and cleaning throughout the entire heap.
Customization: generation-based collectors have introduced the concept of “young generation”, or “heavenly space” (eden space), and the program's performance depends on its size.
Warm-up time: In response to a tuning problem, some collectors dynamically adapt the size of the younger generation based on the behavior of the program being run. But now the pauses have become dependent on the duration of the program. This is usually only important for benchmark results.

Nevertheless, the benefit from the use of collectors on the basis of generations is so great that today this type absolutely dominates. If you are ready to accept shortcomings, then you will surely like such collectors. These algorithms can be extended with all sorts of functions, typical modern assemblers can be in one person multi-threaded, parallel, compacting and using generations.

Parallel Go Gatherer

Go is a rather ordinary imperative language with value types. Perhaps his memory access patterns can be compared with C #, which uses the hypothesis of generations, therefore, the .NET collector is used.

In fact, Go programs usually require request / response handlers like HTTP servers. That is, they demonstrate behavior that is strongly tied up over generations. The creators of Go think about how this can be used in the future through such things as the “ request oriented collector ”. As already noted, this is simply a renamed generation-based collector with a customized tenure policy.
You can emulate such a collector in other runtime for request / response handlers. To do this, you need to make sure that the younger generation is large enough to fit all the garbage generated during the processing of the request.

But despite this, the collector used in Go today is not a generation-based collector. It simply performs in the background the good old procedure of tagging with cleaning.

This approach has one advantage: you can get very, very short pauses. But all other parameters will worsen. For example:

Collector Throughput: The time it takes to clear a heap of garbage increases with heap size. The more memory your program uses, the slower this memory is released and the more time the computer spends on assembling compared to useful work. You will avoid all this only if the program does not parallelize its execution at all, but you can continue to use the kernel for garbage collection without restrictions.
Summarization: since it is not executed, the program may result in fragmenting the entire heap. We'll talk more about this below. Also, you do not get the benefits of careful use of the cache.
Program bandwidth: the builder must do a lot of work in each cycle. It takes more CPU time, which could have been given to the program itself, and because of that it is slower.
Distribution of pauses: a collector running simultaneously with a program can lead to what in the Java world is called a failure of the joint execution mode (concurrent mode failure): the program generates garbage faster than the threads of the collector have time to clean it. In this case, the runtime has to completely stop the program and wait for the cleaning cycle to complete. So when the Go authors claim that the pauses are very short, this applies only to cases where the collector has enough processor time to keep up with the program. In addition, the Go compiler lacks the ability to pause threads reliably and quickly . That is, the duration of the pauses depends strongly on the code you are executing (for example, base64-decoding of a large blob in a single gorutin can lead to an increase in pauses).
Heap redundancy: Considering the slowness of garbage collection on the heap with the help of tagging and cleaning, you need a lot of space in order not to face the failure of the joint execution mode. By default, Go provides 100% redundancy ... that is, it doubles the amount of memory needed by your program.

Here is an excerpt from one post , which describes the above disadvantages:

Service 1 allocates more memory than Service 2, so the pauses for a full stop are longer. However, in both services, the absolute length of the pauses of the stop is reduced by an order of magnitude. After switching on on both services, we observed ~ 20% growth in consumption by the processor time collector.

In this case, the duration of pauses in Go has decreased by an order of magnitude, but at the expense of slowing down the work of the collector. Can this be considered a justified compromise or has the duration of pauses been already quite low? The author did not say.

However, there comes a time when it no longer makes sense to increase the capacity of iron to reduce pauses. If pauses on the server are reduced from 10 to 1 ms, will users notice this? And if for such a reduction you need to double the hardware resources?

Go optimizes the duration of pauses at the expense of bandwidth, and so much so that it seems as if he wants to slow down your program by an order of magnitude, just to make the pauses a bit smaller.

Java comparison

The Java HotSpot virtual machine has several garbage collection algorithms. You can select them from the command line. None of them try to reduce the pauses as much as Go, because they are trying to maintain a balance. To feel the impact of compromise, you can compare the algorithms with each other, switching between different collectors. How? Using a simple restart of the program, because compilation is performed as it is executed, so that the different barriers needed for different algorithms can be compiled and optimized in code as needed.

On modern computers, the default generation-based collector is used. It is created for batch tasks and initially does not pay attention to the length of the pause (it can be set on the command line). Because of the ability to choose the default collector, many people believe that the lousy garbage collector in Java: out of the box, Java tries to make your application run as quickly as possible, with the least memory redundancy, and does not give a damn about pauses.

If you need to reduce the duration of pauses, then you can switch to the collector with parallel marking and cleaning (concurrent mark / sweep collector, CMS). This is closest to what is used in Go. But this is an algorithm based on generations, so it has a longer pause than Go: the younger generation is compacted during pauses, because objects are moving. There are two types of pauses in CMS: shorter, about 2-5 ms, and longer, about 20 ms. CMS works adaptively: since it is executed simultaneously (concurrent), it must assume when it will start (as in Go). While Go asks you to configure heap redundancy, the CMS adapts itself during runtime, trying to avoid simultaneous execution failures. Since most of the heaps are handled using regular tagging and cleaning, you may encounter problems and brakes due to heap fragmentation.

The most recent generation collector in Java is called G1 (from garbage first). By default it works since Java 9. The authors tried to make it as universal as possible. For the most part, it runs at the same time, based on generational generation, and compacts the entire heap. Much self-configuring. But since he does not know what you want (like all garbage collectors), it allows you to adjust the trade-offs: just specify the maximum amount of memory you allocate to it and the size of pauses in milliseconds, and the rest will adjust the algorithm yourself to meet your requirements. . By default, the duration of pauses is about 100 ms, so if you do not reduce them yourself, do not expect the algorithm to do this: G1 will prefer the speed of the application.

The pauses are not quite consistent: most are very short (less than 1 ms), but there are also a few long ones (more than 50 ms) related to the compaction of the heap. G1 scales well. There are reviews of people who used it on heaps of terabyte size. Also, G1 has a number of nice features like deduplication of rows in a heap.

Finally, another new algorithm was developed under the name Shenandoah. It is included in OpenJDK, but will not appear in Java 9 until you start using special Java builds from Red Hat (project sponsor). The algorithm is designed to minimize the length of pauses, regardless of the size of the heap, which at the same time is compacted. The disadvantages include high heap redundancy and a number of barriers: to move objects during the execution of the application, you must simultaneously read the pointer and interact with the garbage collector. In this sense, the algorithm is similar to the “non-stop” picker from Azul.

Conclusion

The purpose of this article is not to convince you to use other languages or tools. Garbage collection is a difficult, really difficult problem that has been studied for decades by the army of scientists and programmers. Therefore, be suspicious of breakthrough solutions that no one noticed. Most likely, these are merely strange or unusual disguised compromises avoided by others for reasons that will become clear later.

And if you want to minimize the duration of pauses at the expense of all other parameters at any cost, then contact the garbage collector from Go.

Source: https://habr.com/ru/post/318504/

All Articles