Garbage collection in the persistent model: from terabyte and beyond

Hello to all. I will continue on the Phantom. To understand it is useful to read the article about persistent RAM , as well as a general article about Phantom on Open Systems . But you can and so.

So, we have an OS (or just an environment, it doesn’t matter) that provides applications with persistent RAM, and generally persistent "life." Programs live in a shared address space with managed (managed) pointers, an object bytecode machine, do not notice the OS restart and, in general, are happy.

Obviously, such an environment needs garbage collection. But - what?
')
There are several problems imposed by specifics.

First, theoretically, the amount of virtual memory in such an environment is huge - terabytes, the entire contents of the disk. After all, we display in memory everything and always.

Secondly, we are absolutely not satisfied with the stop the world algorithms. If for a normal process a stop in half a second may be acceptable, then for virtual memory, which, for the most part, on disk, it will be already half an hour, otherwise it would not be half a day!

Finally, if we assume that a complete garbage collection is half a day, we are probably not satisfied with it - it would be great to have some quick garbage collection process, even if not completely honest, let him lose some garbage, but if he manages to quickly return 90% is already good.

A reservation is needed here. Generally speaking, in a system that has a couple of terabytes of virtual memory, this is not so critical - even if you don’t do a half a day freeing up the memory, it may not be that much and run - well, for example, spend 2-3, well, 5 gigabytes, well and 50 gigabytes is not a pity, the disk is big.

But, most likely, this will lead to a large fragmentation of memory - many local variables will be scattered across many far-located pages, while there is a high probability that small patches of relevant information will be interspersed with tons of irrelevant garbage, which will greatly increase the load on RAM.

Ok, so we have two tasks.

Fast, inexpensive, possibly incomplete (does not collect all the garbage), non-stop algorithm for permanent use. It is assumed that the memory with which it will “touch” is real in memory, and not released to disk.
A complete, possibly long and expensive, but also non-stop (!) Algorithm for the entire address space. It is assumed that access to the "memory" will bring it to disk with a probability of 99%.

The second requirement seems unreal, and in fact it is. The complete memory assembly is the traversal of almost all objects (some algorithms can be optimized), and if they are on a disk and a terabyte disk, then the estimated time is hours, well, if not 24 hours. What is non-stop?

An interesting idea comes to the rescue, which, as often happens, is obvious, as soon as you read it from someone: an object that was rubbish yesterday will be rubbish tomorrow.

This, in practice, means that if we have yesterday's photograph of the entire system memory, then garbage collection can be done exactly according to the photo! At the same time, the garbage found in it can be freed in the current version of the memory too!

But the OS with persistent RAM is exactly what is involved in creating complete “photos” of the memory state.

(I must say that I love very much when such conceptual coincidences arise - for me they are a sign that the model of the system is sufficiently complete and orthogonal. )

Thus, an amazing situation arises - we have a “picture” of the system memory, and using it we can do garbage collection with even the most banal mark & sweep algorithms.

Let me remind you how it works, in brief: we go around all accessible objects by the links and put a marker to them: “not rubbish”. Then we go around everything and that without a marker, then - garbage.

However, there is a whole host of optimization problems.

Let's start with the obvious: it is expensive and dangerous to write to disk, if a failure occurs, we will lose the system snapshot, which is unacceptable.

So optimization is needed. Begs two options.

The first is to make a separate copy of the snapshot in the garbage collection process. Terribly expensive - it will take almost double the amount of disk memory. And overwrite almost all pages.

The second is not to touch the original (which is always good), and put markers in a separate parallel (sorted by object addresses?) Buffer, probably in the form of a hashmap or sorted tree.

Offhand, the choice is obvious. But maybe there is a third option?

Another point that suggests itself is the marking of generations. Intuitively, it seems that 90% of the data will never become conditional garbage — code of classes, objects with music and video — all this has been stored for centuries and clearly does not require daily verification.

Probably, a large assembly can be done sometimes also partially - for example, only the last generation is processed once a day, the last two times every two days, and the last two are shaken up for the weekend. Spring-cleaning.

However, you could ask the main allocator and the quick hint garbage collector. For example, if in the course of work objects from the lowest, conditionally “constant” generation (where classes and huge constant objects lie) were modified, this could be a trigger for a complete assembly.

Actually, today only a fast algorithm is implemented in Phantom OS - based on the reference counter. Complete collector is waiting for his hero.

With fast, by the way, everything is also not easy. There is unsolvable (do not worry, we almost solved it :) the problem with races is the synchronization of the increment of the number of uses of the object.

Its essence is simple. We are reading from the field of the object A a link to the object B. As we now own it (the link) and want to use it, of course, we increase by 1 the reference count inside the object B.

object *b_ptr = a[xx]; b_ptr->ref_count++;

But - the trouble is - between this pair of lines, a context switch happened and another thread at that very moment took and deleted from the object And this very link to B. And it was the last one. And the reference counter went to zero, and the object became remote. And here we began to increase his reference count. And he has already been killed, cleaned, allocated to someone else and contains completely different data.

Fun? And you will not put a mutex on every readout of the object field - then the machine will not crawl out of such a mutex, only that will lock it and unlock it. Even the spinlock will lead to a multiple, at times slowing down of the code.

Unacceptable.

The solution looks like this: to actually destroy the object only after all the threads of all virtual machines have passed the border of the virtual machine instructions. This ensures that all decrements and increments are over. The mechanism for this is, it is already used when snapshots.

I’ll finish it for today, noting that I haven’t touched the task of compactification (or defragmentation) of objects at all. Also, generally speaking, no end. Separately, there is the task of writing an allocator of objects with a high degree of localization of allocation, moreover, with the splitting of arenas of fast objects into threads. Here is the separation of the mutexes of the allocator along the arenas so that no congestion occurs at the entrance to the allocator.

Actually, all this is a very good example of the tasks that arise in the OS project Phantom - there is nothing unreal, but good engineering solutions are constantly required in order to get around the necessary problems.

Source: https://habr.com/ru/post/281236/

All Articles

Garbage collection in the persistent model: from terabyte and beyond

More articles: