Phantom: large garbage collection

This article is a continuation, beginning here . For those who have not clicked on the link, a brief introductory:

We discuss garbage collection in the Phantom operating system, that is, in the virtual (bytecode) environment of a machine running in persistent RAM. The size of persistent memory is about the size of the disk, that is, units of terabytes for today and, potentially, tens and hundreds of terabytes tomorrow.

Since we are talking about virtual memory, a significant part of the objects is in any case not in RAM, regardless of the algorithm and the approach we have chosen in general. That is - the cost of access to the object is great. This, in general, is a disk operation.
')
In addition, it should be expected that in a networked environment, aggregates of such virtual machines will exchange proxy objects, that is, there will be a graph of objects spread over many machines on the network and, of course, throughout this nightmare you will need to be able to collect garbage not only on one machine, but also on the network.

The idea of garbage collection that I have adopted in such an environment looks like a combination of two collectors.

The first is fast, deterministic, but not guaranteeing 100% collection. Currently it is implemented on the principle of counting the number of references to an object. Imperfect, but rather simple and predictable model.

The second is slow, non-deterministic (it is difficult to predict the operating time), but accurate and uncompromising. In the traditional environment, such a collector would require a stop of the world, but there is a trick - if the assembly is kept on a full copy of the system state, then all the garbage collected in the copy will be garbage also in a later version of the same state that neither was happening. The beauty of the approach is that Phantom realizes the persistence through the creation of snapshots - full copies of the state of the virtual machine's object memory. That is - an environment in which you can calmly and rationally collect garbage is already there.

In fact, at this moment the task seems trivial - on the “immobilized patient”, garbage can be collected in the simplest and simplest way possible. Yes? Looks like no.

Question one - storing intermediate information about the status of the assembly. And its volume. We have to assume that the situation is terrible and the system has managed to push itself into a corner, having spent all disk space. How much should we reserve for the collector so that he can finish the work?

The second question is the restartability of the algorithm. It would be tempting to implement the collector as a full-time program under Phantom, which, therefore, lives in persistent memory and OS restarts for it are imperceptible, the entire state of the program is preserved and garbage collection just continues further after the system restart. But due to the first requirement, such an implementation can be dangerous - if there is not enough memory, the user process can “finish” it and the work of the garbage collector will be stopped. This would be solved through quoting memory allocation, but in the current version of the OS it is not there, well, in general, the solution looks very elegant, but from the point of view of debugging it will most likely look like hell.

Therefore, it would be good to rely on the garbage collector on some simple data structure that is easy to store and process in a linear fashion. For example, a simply linked list, organized as a task list, from which the algorithm pulls out atomic tasks and to which it adds tasks in the process of solving an “pulled out” task.

In this case, it would be good to avoid modifying the original snapshot (copy of the system memory), and updating the list itself to make it atomic so that turning off the algorithm at any point does not lead to a failure.

In general, such a naive implementation suggests itself: a list of unexplained roots objects, a list of visited visited objects, and an algorithm that comes down to:

For an empty roots and visited, put an object in it according to the address zero.
For non-empty - read and delete the address of the object, if it is not in visited - add it to visited, add all its children to roots
For empty roots and non-empty visited - to pass linearly all objects in the address space of the snapshot, if they are not found in the visited - to mark them as garbage in the current address space (this process can also be made restartable if you record the memory address from time to time )

Naturally, this is a very inefficient algorithm, but it can be optimized in a rather trivial way. For example, like this: we do not make the roots queue, but store the stack and the tail of this stack in memory, with a large number of bypass leaf objects going on without modifying the disk portion of this stack. The only important thing is that the objects in the disk copy only fall entirely and leave it only after all children are bypassed.

The problem is that each step of this algorithm requires viewing the entire visited, which is potentially the collapse of all hopes, a fiasco.

The trouble is that “You cannot cache memory in the memory - you should definitely view it in its entirety. This suggests an obvious idea - to make it not a list, but a tree sorted by the address of the object. Then the search for an object in the tree will be reduced logarithmically, and, most importantly, fragments of the tree can be cached, since brute force is not needed.

By the way, if you have ideas about such an algorithm, please write.

A separate question arises in case of failure of disk I / O for any page. Strictly speaking, in such a situation garbage collection is impossible at all. Although for a given snapshot, the disappearance of the page (and, therefore, the links contained in it) makes some of the objects inaccessible, that is, the actual garbage, it would be short-sighted just to take everything and destroy it.

Firstly, it would be good to have some analogue of lost + found for such situations. Although it is completely unclear how to implement it. Secondly, in the current operating system, the corresponding page can be quite alive. Therefore, it would be fair to do this: check if this page is in later shots or in memory. If it is in memory, force its placement into snapshot, even if it has not changed (usually Phantom has not changed pages, of course, does not write again), stop the build and restart it at a later snapshot. If there is no luck and the page is nowhere, turn on recovery mode and at the end of the normal assembly heuristically search in the trash for subtrees of objects of tangible size that are excluded from the trash and “suspended” in a special place in the object hierarchy.

What else is important?

In general, the subsystem of the virtual persistent memory of the Phantom and its virtual bytecode machine (object environment) do not know a damn about each other. The second lives in the first. And that's all.

But one fairly typical case that needs a link between them. This case looks simple: between two snapshots, the program in the OS Phantom selects, uses and frees a couple of gigabytes of objects. For example, it calculates the graphics and in the process creates temporary data. By the beginning of snapshot they are not needed and irrelevant. But the memory in which they lay, "touched", was modified. From a snapshot point of view, this is a reason to write such a memory into a snapshot. In reality, its contents are no longer interesting to anyone and, moreover, should be zeroed out of sin. It would be logical, when releasing a large area of memory, to inform the paging subsystem that this section is not only not dirty, but also not needed at all and does not need to be saved. And when recovering from snapshot, it should be read as a page of zeros.

This again seems trivial, but not. The previous article talked about the problem of deleting objects by zeroing the reference counter and that this can be done only after the virtual machine instruction passes all the threads.

Technically, the easiest way to do this is to do the removal after the snapshot.

Why? Because snapshot is guaranteed to suspend all threads on the boundary of the bytecode instructions. Why after, and not during? Because the synchronous part of the snapshot must be run so that it is invisible to running programs.

And after - late, because then the "unnecessary" pages, nevertheless, will fall into snapshot.

As a result, this means that you need to perform a not very obvious chain of operations:

Suspend all threads on the border of the instructions and immediately "release" them
Perform the release of objects with a zero reference count that was claimed to be deleted before this stop (checking that the counter is still zero)
Suspend all threads on the border of the instructions again
Execute snapshot synchronous part (in memory)
“Release” stopped threads
To quietly finish the asynchronous part of the snapshot (input-output)

To all this, you still need to add a lock on that snapshot on which garbage collection takes place, from removal, as well as a mechanism that allows you to keep at least three partially blocked snapshots -

Old garbage collector
More relevant, which is full and suitable for restart
The latter, which is in the process of formation

And maybe a few more that are stored in backup mode / time machine.

On this, let me put a semicolon, and ask the question: what statistics on the state of the object environment would you consider useful to collect in the garbage collection process? We still go around the objects, it is possible to conduct a particular analysis relatively free.

Source: https://habr.com/ru/post/331974/

All Articles

Phantom: large garbage collection

More articles: