📜 ⬆️ ⬇️

Homemade garbage collector for OpenJDK

This is a translation of the article “Do It Yourself (OpenJDK) Garbage Collector” by Alexey Shipilyov, published with the consent of the author. Report any typos and other bugs in PM - we'll fix them.

The process of creating something in a runtime of a language is a fun exercise. At least the creation of the first version! It is a very, very difficult task to build a reliable, high-performance, fault-tolerant subsystem of runtime, the behavior of which can be conveniently monitored and debugged.


It is deceptively simple to make a simple garbage collector, and here you want to do this in this article. Roman Kennke on FOSDEM 2019 gave a talk and a demo called “Writing GC in 20 minutes” using an earlier version of this patch. Despite the fact that the code implemented there demonstrates a lot and is abundantly commented, there is a need for a good high-level description of what is happening - this is how the article appeared.


A basic understanding of the work of garbage collectors will greatly help in understanding what is written here. The article will use specificity and ideas in a specific implementation of HotSpot, but there will not be an introductory course on GC design. Take the GC Handbook and read the first chapters about the very basics of GC, and even more quickly allow you to start an article on Wikipedia .



Content




1. What does GC consist of?


Now, when many different GCs are written, it’s pretty simple to make your own - many elements already written can be (re) used to shift some of the worries about implementation details to proven and tested code.



1.1. Epsilon GC


OpenJDK 11 has a new JEP 318: "Epsilon: A No-Op Garbage Collector (Experimental)" . Its task is to provide a minimal implementation for the case when freeing memory is unnecessary or even prohibited. The JEP discusses in more detail why it might be useful.


From the point of view of implementation, “garbage collector” is a bad name; it would be better to use the term “automatic memory manager” , which is responsible for both allocating and freeing memory. Epsilon GC implements only the “selection”, and the “release” is not engaged at all. Therefore, we can take Epsilon GC and start implementing the “release” algorithms from scratch.



1.1.1. Memory allocation


The most developed part of Epsilon GC is responsible for memory allocation . It serves external requests for allocating memory of arbitrary size and creating a Thread-Local Allocation Buffer (TLAB) of the desired size. The implementation itself tries not to extend TLAB too much, since there will be no memory freeing and no one will return the lost bytes.



1.1.2. Barriers


Some garbage collectors require interaction with the application to maintain GC invariants, forcing the runtime and the application to do so-called barriers when trying to access the heap. This is true for all multithreaded collectors, and for many collectors with generations and stopping the world.


Epsilon does not require barriers, but the runtime and compiler still want to know that barriers do nothing. Handling it every time can be tedious everywhere. Fortunately, starting with OpenJDK 11, there is a new JEP-304: “Garbage Collection Interface” , thanks to which it has become much, much easier to insert barriers. In particular, the barrier set in Epsilon is empty , and all the trivial work - save, load, CAS, arraycopy - can be delegated to implementations of trivial barriers from an already existing superclass. If you are doing a GC, which also does not need barriers, you can simply reuse the code from Epsilon.



1.1.3. Connection to monitoring


The last tedious part of the GC implementation is hooks to a bunch of monitoring mechanisms inside the JVM: MX-bins, diagnostic commands, etc. should work. Epsilon has already done all this for you.



1.2. Rantaym and GC



1.2.1. Root elements


The garbage collector, in general, needs to know that it is in the Java runtime that has links to the heap. These root elements, called GC Roots , can be slots on stream stacks and local variables (including those that are in JIT-compiled code!), Native classes and classloaders, links in JNI, and so on. Attempts to identify these elements can be very difficult and tedious. But in Hotspot, they are all tracked using the appropriate VM subsystems, so you can simply examine how the existing GC implementations work with them. Further in the text we will see it.



1.2.2. Object Traversal


The garbage collector should bypass outbound links in Java objects. This operation is found everywhere, so the common parts of the runtime provide ready-made workaround tools; you don’t have to write anything yourself. Below, there will be a section with a specific implementation, and there you can find, for example, calls obj→oop_iterate .



1.2.3. Moves


The moving garbage collector needs to write down new addresses of the objects being moved somewhere. There are several places where you can write this data about the movements ( forwarding data ).


  1. You can reuse the “marker word” (“mark word”) in the object itself (Serial, Parallel, etc.). After the world stops, all access to the object is controlled, and it is guaranteed that not a single Java stream will be able to see the temporary data that we decided to write to the marker word. You can reuse it to store forwarding data.
  2. You can maintain a separate native motion table ( ZGC , C4, and others). This completely isolates the GC from runtime and the rest of the application, since only the GC knows about the existence of such a table. Competitive collectors usually use just such a scheme - they do not want to suffer with a bunch of unnecessary problems.
  3. You can add one more word to the object ( Shenandoah and others). This combination of the two previous approaches not only gives runtime and the application to work with existing headers without problems, but also saves forwarding data.


1.2.4. Marker data


The garbage collector needs to write marking data somewhere. And again, there are several ways to save them:


  1. You can reuse the marker word in the object itself (Serial, Parallel, etc.). Again, in world stop mode, you can use the bits in the marker word to encode the fact that there is a label. Further, if it is necessary to bypass all living objects, we go along the heap, object after object - this is possible due to the fact that the heap is parsable .
  2. You can maintain a separate structure for storing marking data (G1, Shenandoah, etc.). This is usually done using a separate bitmap that maps every N bytes of the heap to 1 bit of the card. Usually, Java objects are 8 byte aligned , so the map displays every 64 bits from the heap to 1 bit of the card, taking up 1/64 of the heap size in native memory. These overheads pay off well when scanning a heap for the presence of living objects, especially sparse ones: bypassing the map is often much faster than bypassing the heap to be disassembled object by object.
  3. Encode tags in the links themselves (ZGC, C4 and others). This requires coordination with the application, then you need to cut out all these tags from the links or perform some other tricks to maintain correctness. In other words, we need either barriers or some additional work on the part of the GC.


2. General plan


Most likely, the easiest to implement on top of Epsilon is Mark-Compact, in the style of LISP2. The main idea of ​​this GC is described both in Wikipedia and in the GC Handbook (chapter 3.2). The algorithm will be sketched in the implementation section below, but I strongly recommend reading a little of Wikipedia or the GC Handbook to understand what we are going to do.


The algorithm in question is a shifting GC: moving objects move in an crowd at the very beginning of the heap. It has its pros and cons:



For simplicity, the GC implementation will use a full stop of the world (stop-the-world, STW), there will be no generations and multithreading in it. For this case, it makes sense to use a bitmap to store marks and reuse the marker word to store movement data.



3. Implementing the GC core


Reading and understanding the whole implementation may be too complicated for an ignorant person. In this section we will look at it step by step.



3.1. Prologue


The garbage collector usually needs to do a couple of things to prepare for the build. Read the comments, they should speak for themselves:


 { GCTraceTime(Info, gc) time("Step 0: Prologue", NULL); //      .      //   :   ,   ,  // «»   ,      //   ,     . if (!os::commit_memory((char*)_bitmap_region.start(), _bitmap_region.byte_size(), false)) { log_warning(gc)("Could not commit native memory for marking bitmap, GC failed"); return; } //        ,  , //       TLAB-. ensure_parsability(true); //      ,    GC. CodeCache::gc_prologue(); BiasedLocking::preserve_marks(); //        . //       . DerivedPointerTable::clear(); } 

Since we use a bitmap to track objects reachability, we need to clean it up before using it. Or in our case, since we are aiming to never request resources before running the GC cycle, we have to commit the bitmap to memory in advance. This offers several interesting advantages, at least on Linux, where most of the bitmap will point to zero page, especially for sparse heaps.


The threads should release their TLABs and ask the GC for new ones after the build is complete.


Do not confuse TLAB and java.lang.ThreadLocal . From the point of view of the GC, ThreadLocals are ordinary objects, and they will not be compiled by the GC unless specifically required in the Java code.

Some parts of the runtime, especially holding links to the java heap, will break down during garbage collection, so you need to specifically warn them that the GC will start working soon. This will allow the relevant subsystems to get ready and save some of their state before the GC makes its move.



3.2. Marking


Marking in the stop mode of the world becomes quite simple when almost everything has already been done for us. Marking is pretty standard, and most likely, in many GC implementations is the first step.


 { GCTraceTime(Info, gc) time("Step 1: Mark", NULL); //    ,     .  //   ,  ,    //      . EpsilonMarkStack stack; EpsilonScanOopClosure cl(&stack, &_bitmap); //      . process_roots(&cl); stat_reachable_roots = stack.size(); //    ,    . //    ,   , //      . while (!stack.is_empty()) { oop obj = stack.pop(); obj->oop_iterate(&cl); stat_reachable_heap++; } //       . DerivedPointerTable::set_active(false); } 

This works in exactly the same way as for any other graph: you start a detour from the initial set of reachable vertices, go along the outgoing edges and record all the visited vertices. The detour continues until all unvisited vertices end. In GC, “vertices” are objects, and “edges” are links between them.


Technically, we could just recursively walk through the object graph, but this is a bad idea for arbitrary graphs that can have very large diameters. Imagine a linked list of a billion peaks! Therefore, to limit the depth of recursion, we use a labeling stack that records the detected objects.


The initial set of reachable objects is GC roots. Now you should not dwell on the fact that such a process_roots , this will be later. For now, let's just say that it bypasses all the accessible references from the VM side.


The bitmap with marks serves both as a tool for recording the marking front (a set of objects already visited), and at the end - as a repository of the desired result, a set of all achievable objects. The real work takes place in EpsilonScanOopClosure , it is applied to all interesting objects and is iterated over all links of the selected object.


Look, Java knew how to closure before it became fashionable!

 class EpsilonScanOopClosure : public BasicOopIterateClosure { private: EpsilonMarkStack* const _stack; MarkBitMap* const _bitmap; template <class T> void do_oop_work(T* p) { // p -     ,   oop,   //      ,  : T o = RawAccess<>::oop_load(p); if (!CompressedOops::is_null(o)) { oop obj = CompressedOops::decode_not_null(o); //  . ,   .  , //        . //    +, //       . if (!_bitmap->is_marked(obj)) { _bitmap->mark((HeapWord*)obj); _stack->push(obj); } } } }; 

After completing this step, _bitmap contains bits indicating the location of live objects. Thanks to this, it is possible to bypass all living objects, for example:


 //           . //   ,    ( )  ,  //       1/64  . void EpsilonHeap::walk_bitmap(ObjectClosure* cl) { HeapWord* limit = _space->top(); HeapWord* addr = _bitmap.get_next_marked_addr(_space->bottom(), limit); while (addr < limit) { oop obj = oop(addr); assert(_bitmap.is_marked(obj), "sanity"); cl->do_object(obj); addr += 1; if (addr < limit) { addr = _bitmap.get_next_marked_addr(addr, limit); } } } 


3.3. We calculate new addresses


This is also a fairly simple step, and it implements exactly what the algorithm says.



 //    forwarding data (,    ) //   .        . //          . PreservedMarks preserved_marks; //     GC. HeapWord* new_top; { GCTraceTime(Info, gc) time("Step 2: Calculate new locations", NULL); //    ,        //    . ,  - . EpsilonCalcNewLocationObjectClosure cl(_space->bottom(), &preserved_marks); walk_bitmap(&cl); //         . //       ,    //      ,      "" //  . new_top = cl.compact_point(); stat_preserved_marks = preserved_marks.size(); } 

The only thing that catches your eye here is that we decided to store new addresses in the marking word of java objects, and this word may already be occupied by something important, for example, under information about locks. Fortunately, such non-trivial marking words are quite rare, and we can simply store them separately, if it is needed at all: this is what PreservedMarks uses.


EpsilonCalcNewLocationObjectClosure makes real algorithmic work:


 class EpsilonCalcNewLocationObjectClosure : public ObjectClosure { private: HeapWord* _compact_point; PreservedMarks* const _preserved_marks; public: EpsilonCalcNewLocationObjectClosure(HeapWord* start, PreservedMarks* pm) : _compact_point(start), _preserved_marks(pm) {} void do_object(oop obj) { //    :    . //        (      , //    ),      , //     . if ((HeapWord*)obj != _compact_point) { markOop mark = obj->mark_raw(); if (mark->must_be_preserved(obj)) { _preserved_marks->push(obj, mark); } obj->forward_to(oop(_compact_point)); } _compact_point += obj->size(); } HeapWord* compact_point() { return _compact_point; } }; 

forward_to is the most important part, since it stores the “move address” in the marker word of the object. This will be needed in the following steps.



3.4. Fix pointers


Now you need to go through the heap again and rewrite all the links with their new addresses according to the following algorithm:



 { GCTraceTime(Info, gc) time("Step 3: Adjust pointers", NULL); //     _   _,     // « ».      forwarding data, //    .      . EpsilonAdjustPointersObjectClosure cl; walk_bitmap(&cl); //     ,      VM,  //     :      . EpsilonAdjustPointersOopClosure cli; process_roots(&cli); //   ,      , //     . preserved_marks.adjust_during_full_gc(); } 

There are two types of links to the shifted objects: outgoing either from the objects on the heap itself, or from GC roots. Update need both classes of links. Some saved tags also store links to objects, so you need to ask them to update. PreservedMarks knows how to do this, because it expects "forwarding data" in the same place where we saved it, in the marking word of the object.


Closures are divided into two types: some accept objects and bypass their contents, others update these addresses. Here you can make a small optimization of performance: if the object does not move, you can save a couple of records in a heap:


 class EpsilonAdjustPointersOopClosure : public BasicOopIterateClosure { private: template <class T> void do_oop_work(T* p) { // p -     ,   oop. //        ,  : T o = RawAccess<>::oop_load(p); if (!CompressedOops::is_null(o)) { oop obj = CompressedOops::decode_not_null(o); //         . //  ,    . if (obj->is_forwarded()) { oop fwd = obj->forwardee(); assert(fwd != NULL, "just checking"); RawAccess<>::oop_store(p, fwd); } } } }; class EpsilonAdjustPointersObjectClosure : public ObjectClosure { private: EpsilonAdjustPointersOopClosure _cl; public: void do_object(oop obj) { //    ,    : obj->oop_iterate(&_cl); } }; 

After completing this step, we essentially broke the heap: the links point to the “wrong” addresses where the objects are not yet located. Let's fix it!



3.5. Moving objects


Time to move objects to new addresses, in accordance with the algorithm:



EpsilonMoveObjectsObjectClosure around the heaps again and apply EpsilonMoveObjectsObjectClosure closure to all living objects:


 { GCTraceTime(Info, gc) time("Step 4: Move objects", NULL); //       . //          . EpsilonMoveObjectsObjectClosure cl; walk_bitmap(&cl); stat_moved = cl.moved(); //         ,   // «»      . _space->set_top(new_top); } 

Immediately after that, you can drag the heap point of the compression point, making it possible to allocate memory right from this place, as soon as the garbage collection cycle ends.


Notice that in the shifting assembly we can overwrite the contents of existing objects, but since the scan goes in the same direction, the overwritten objects are already copied to the right place.


The old and the new location of the same object may intersect. For example, if you move a 100-byte object to 8 bytes. The copying procedure should work itself out, and the intersecting content should be copied correctly, pay attention to Copy::aligned_*conjoint*_words .

The closure itself will simply move the objects being moved to the new addresses:


 class EpsilonMoveObjectsObjectClosure : public ObjectClosure { public: void do_object(oop obj) { //     ,  .   - , //   -  mark word, //      forwarding data. if (obj->is_forwarded()) { oop fwd = obj->forwardee(); assert(fwd != NULL, "just checking"); Copy::aligned_conjoint_words((HeapWord*)obj, (HeapWord*)fwd, obj->size()); fwd->init_mark_raw(); } } }; 


3.6. Epilogue


The garbage collection is over, the heap is almost consistent again, the last finishing touches are left:


 { GCTraceTime(Info, gc) time("Step 5: Epilogue", NULL); //    . preserved_marks.restore(); //   ,    . DerivedPointerTable::update_pointers(); BiasedLocking::restore_marks(); CodeCache::gc_epilogue(); JvmtiExport::gc_epilogue(); //     . if (!os::uncommit_memory((char*)_bitmap_region.start(), _bitmap_region.byte_size())) { log_warning(gc)("Could not uncommit native memory for marking bitmap"); } //    ,  . //        . if (EpsilonUncommit) { _virtual_space.shrink_by((_space->end() - new_top) * HeapWordSize); _space->set_end((HeapWord*)_virtual_space.high()); } } 

We notify the rest of the runtime that they should run the post-assembly procedures. We restore the special marker words that we saved earlier. Farewell kiss of our marker map - it is no longer needed.


And if you really want to, you can reduce the memory for allocations to the new size, thereby returning the memory to the operating system!



4. Connect GC to VM



4.1. Root Bypass


Remember, you need to bypass the special, reachable links from the VM? You can ask each VM subsystem to bypass the links hidden from other Java objects. An exhaustive list of such root elements in the current Hotspot looks something like this:


 void EpsilonHeap::do_roots(OopClosure* cl) { //   ,        1 . StrongRootsScope scope(1); //         . CLDToOopClosure clds(cl, ClassLoaderData::_claim_none); MarkingCodeBlobClosure blobs(cl, CodeBlobToOopClosure::FixRelocations); //      . //        . { MutexLockerEx lock(CodeCache_lock, Mutex::_no_safepoint_check_flag); CodeCache::blobs_do(&blobs); } { MutexLockerEx lock(ClassLoaderDataGraph_lock); ClassLoaderDataGraph::cld_do(&clds); } Universe::oops_do(cl); Management::oops_do(cl); JvmtiExport::oops_do(cl); JNIHandles::oops_do(cl); WeakProcessor::oops_do(cl); ObjectSynchronizer::oops_do(cl); SystemDictionary::oops_do(cl); Threads::possibly_parallel_oops_do(false, cl, &blobs); } 

, . GC .



4.2.


GC , VM . Hotspot VM_Operation , GC VM- :


 // VM_operation,      class VM_EpsilonCollect: public VM_Operation { private: const GCCause::Cause _cause; EpsilonHeap* const _heap; static size_t _last_used; public: VM_EpsilonCollect(GCCause::Cause cause) : VM_Operation(), _cause(cause), _heap(EpsilonHeap::heap()) {}; VM_Operation::VMOp_Type type() const { return VMOp_EpsilonCollect; } const char* name() const { return "Epsilon Collection"; } virtual bool doit_prologue() { //     ,     . //         GC, //          . //   ,         //  .     , //       1%, ,  , //     . Heap_lock->lock(); size_t used = _heap->used(); size_t capacity = _heap->capacity(); size_t allocated = used > _last_used ? used - _last_used : 0; if (_cause != GCCause::_allocation_failure || allocated > capacity / 100) { return true; } else { Heap_lock->unlock(); return false; } } virtual void doit() { _heap->entry_collect(_cause); } virtual void doit_epilogue() { _last_used = _heap->used(); Heap_lock->unlock(); } }; size_t VM_EpsilonCollect::_last_used = 0; void EpsilonHeap::vmentry_collect(GCCause::Cause cause) { VM_EpsilonCollect vmop(cause); VMThread::execute(&vmop); } 

, GC — , .



4.3.


, GC , , GC , . , allocate_work , GC :


 HeapWord* EpsilonHeap::allocate_or_collect_work(size_t size) { HeapWord* res = allocate_work(size); if (res == NULL && EpsilonSlidingGC) { vmentry_collect(GCCause::_allocation_failure); res = allocate_work(size); } return res; } 

That's all!



5.


OpenJDK.


 $ hg clone https://hg.openjdk.java.net/jdk/jdk/ jdk-jdk $ cd jdk-jdk $ curl https://shipilev.net/jvm/diy-gc/webrev/jdk-jdk-epsilon.changeset | patch -p1 

OpenJDK :


 $ ./configure --with-debug-level=fastdebug $ make images 

:


 $ build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+EpsilonSlidingGC -version openjdk version "13-internal" 2019-09-17 OpenJDK Runtime Environment (build 13-internal+0-adhoc.shade.jdk-jdk-epsilon) OpenJDK 64-Bit Server VM (build 13-internal+0-adhoc.shade.jdk-jdk-epsilon, mixed mode, sharing 


6.


, GC ? :


  1. . . Hotspot , JVM fastdebug , GC.
  2. . , . , ( ) , .
  3. . , , , . - , .

, , :


 $ CONF=linux-x86_64-server-fastdebug make images run-test TEST=gc/epsilon/ Building targets 'images run-test' in configuration 'linux-x86_64-server-fastdebug' Test selection 'gc/epsilon/', will run: * jtreg:test/hotspot/jtreg/gc/epsilon Running test 'jtreg:test/hotspot/jtreg/gc/epsilon' Passed: gc/epsilon/TestAlwaysPretouch.java Passed: gc/epsilon/TestAlignment.java Passed: gc/epsilon/TestElasticTLAB.java Passed: gc/epsilon/TestEpsilonEnabled.java Passed: gc/epsilon/TestHelloWorld.java Passed: gc/epsilon/TestLogTrace.java Passed: gc/epsilon/TestDieDefault.java Passed: gc/epsilon/TestDieWithOnError.java Passed: gc/epsilon/TestMemoryPools.java Passed: gc/epsilon/TestMaxTLAB.java Passed: gc/epsilon/TestPrintHeapSteps.java Passed: gc/epsilon/TestArraycopyCheckcast.java Passed: gc/epsilon/TestClasses.java Passed: gc/epsilon/TestUpdateCountersSteps.java Passed: gc/epsilon/TestDieWithHeapDump.java Passed: gc/epsilon/TestByteArrays.java Passed: gc/epsilon/TestManyThreads.java Passed: gc/epsilon/TestRefArrays.java Passed: gc/epsilon/TestObjects.java Passed: gc/epsilon/TestElasticTLABDecay.java Passed: gc/epsilon/TestSlidingGC.java Test results: passed: 21 TEST SUCCESS 

? fastdebug . ? - .



7.


- spring-petclinic , Apache Bench GC! , , GC .


: -Xlog:gc -XX:+UnlockExperimentalVMOptions -XX:+UseEpsilonGC -XX:+EpsilonSlidingGC :


:


 Heap: 20480M reserved, 20480M (100.00%) committed, 19497M (95.20%) used GC(2) Step 0: Prologue 2.085ms GC(2) Step 1: Mark 51.005ms GC(2) Step 2: Calculate new locations 71.207ms GC(2) Step 3: Adjust pointers 49.671ms GC(2) Step 4: Move objects 22.839ms GC(2) Step 5: Epilogue 1.008ms GC(2) GC Stats: 70561 (8.63%) reachable from roots, 746676 (91.37%) reachable from heap, 91055 (11.14%) moved, 2237 (0.27%) markwords preserved GC(2) Heap: 20480M reserved, 20480M (100.00%) committed, 37056K (0.18%) used GC(2) Lisp2-style Mark-Compact (Allocation Failure) 20479M->36M(20480M) 197.940ms 

200 ? GC! , . , , : ( — , ). - ( ).


, GC . , -Xlog:gc -XX:+UseSerialGC — , , :


 GC(46) Pause Young (Allocation Failure) 575M->39M(1943M) 2.603ms GC(47) Pause Young (Allocation Failure) 575M->39M(1943M) 2.606ms GC(48) Pause Young (Allocation Failure) 575M->39M(1943M) 2.747ms GC(49) Pause Young (Allocation Failure) 575M->39M(1943M) 2.578ms 

, 2 . , , GC . -Xlog:gc -XX:+UseSerialGC , , :


 GC(3) Pause Full (Allocation Failure) 16385M->34M(18432M) 1969.694ms GC(4) Pause Full (Allocation Failure) 16385M->34M(18432M) 2261.405ms GC(5) Pause Full (Allocation Failure) 16385M->34M(18432M) 2327.577ms GC(6) Pause Full (Allocation Failure) 16385M->34M(18432M) 2328.976ms 

, . .



8. ?


. , GC OpenJDK — , , .


:


  1. . , // . . , , « » , , .

    GC, java.lang.ref.Reference.referent — Java-, , , - . FinalReference , .
    ReferenceProcessor / / .
  2. VM. VM, , , . . , , , - , .


  3. . — , GC, . , , , .

    mark-compact GC Full GC fallbacks Shenandoah ( OpenJDK 8) G1 ( OpenJDK 10, JEP 307: «Parallel Full GC for G1» ).

  4. . , «» , , , - . . , .


  5. . , , , . , «» — «» «» , .


  6. - GC Handbook .




9.


? GC — , , , GC.


, - GC . , , GC (, Serial GC Parallel GC), .


. , 5-6 2019, JPoint — Java-. — OpenJDK, GraalVM, Kotlin . .

')

Source: https://habr.com/ru/post/443250/


All Articles