📜 ⬆️ ⬇️

Persistent OS: nothing is blocked

This is a question article. I do not have the perfect answer to what will be described here. Some have, but how successful he is is not obvious.

The article deals with one of the conceptual problems of OS Phantom, well, or any other system that has persistent and "volatile" components.

To understand the essence of the problem, it is worth reading one of the previous articles about persistent RAM .
')
Brief statement of the problem: Due to the fact that the application program in the OS Phantom is persistent (does not restart when rebooting), and the kernel does not (restart when rebooting and can be changed between starts), such a system cannot block system calls. In the usual way.

Indeed: if an application program makes a call to the core and in this state we make snapshots, then it is completely incomprehensible how to restore such snapshots. The kernel is not photographed with the snapshot, only the application memory is recorded and restored. It is not clear where she was there in the core. It is not clear whether the actual entry point indicates the correct, generally, place in the core. It is not clear which objects of the application layer the kernel touched and created for itself.

Separately, it is not clear how much snapshot can be done in such a state - does it touch the core of the objects even as we write them to disk.

To begin, we describe the interfaces for data access between the kernel and the object environment.

The object environment has an interface to the kernel in the form of built-in classes — the equivalent of native in Java. These classes are implemented in the kernel as C functions that correspond to methods. Such functions cannot be blocked - they must return ASAP, and while they are executed, snapshots are impossible. This is sufficient for simple methods like window.paintLine () or string.concat (), but no more.

Banal example ( source ):

static int si_string_8_substring(struct pvm_object me, struct data_area_4_thread *tc ) { DEBUG_INFO; ASSERT_STRING(me); struct data_area_4_string *meda = pvm_object_da( me, string ); int n_param = POP_ISTACK; CHECK_PARAM_COUNT(n_param, 2); int parmlen = POP_INT(); int index = POP_INT(); if( index < 0 || index >= meda->length ) SYSCALL_THROW_STRING( "string.substring index is out of bounds" ); int len = meda->length - index; if( parmlen < len ) len = parmlen; if( len < 0 ) SYSCALL_THROW_STRING( "string.substring length is negative" ); SYSCALL_RETURN(pvm_create_string_object_binary( (char *)meda->data + index, len )); } 


If the contents of a regular object are only references, then the internal class object contains an arbitrary data structure that is inaccessible to the virtual machine through the usual instructions, but available inside the methods written in the struct data_area_4_string, in this example.

Obviously, such methods can work with kernel data structures, if they gain access to them. But the opposite is not true - they can’t leave a link to themselves in the core. Rather, they can, but with some "but."

There are two of them.

First, it is necessary for the garbage collector to know that some object is accessible from the kernel and not to “collect” it, even if the links from the object world are gone.

Secondly, it is necessary that the kernel does not touch the data of such an object while the key snapshot formation operation is in progress. What is unbanal and, it seems, can be realized only by a global stop of all threads except the thread of snapshot formation. That, thank God, it takes only a few ms.

As for the garbage collector, this is how it is implemented. At the root of the object environment is present, among other objects, the so-called restart list - a simple list of objects. Any internal class object can be added to it:

 void pvm_add_object_to_restart_list( pvm_object_t o ); void pvm_remove_object_from_restart_list( pvm_object_t o ); 


Being in such a list (and any object with which the kernel works should be in it) serves two tasks. First, it guarantees the presence of a link to an object “on behalf of the kernel” - this link will prevent the GC from killing an object, even if all other objects are forgotten about it.

Secondly, this problem is solved. Suppose we made an object “device”, worked with it, it got into snapshot, after which the system was rebooted through reset. When the kernel is restarted, it should somehow find out about the problem and either restore the connection of such an object with the kernel, or inform it that everything does not work to revive it. (For example, if the corresponding device was removed from the computer.)

To do this, the kernel after the restart, but before the object environment is restarted, puts the restart sheet to the side, creates a new empty one, and walks around all the objects in the old restart list. For each, the restart function is called, which must either reconnect the object to the kernel, or inform the object that it is dead. When the function ends, the object is discarded from the old restart list. If the restart function has connected the object to the kernel again, it will put it in the new restart list. If not, the link to the object “on behalf of the kernel” will disappear. If it was the last, the object will be deleted, because no one needs it.

(See root.h )

Good but not enough. Still, probably, we want to somehow read () from the object environment and wait for the result. Without blocking a direct call from the virtual machine thread inside the instruction.

I considered three options for implementation.

  1. Intermediate stop: a blocking system call consists of a pair: initiating and reading calls. Between them, on the border of the instructions, the virtual machine is blocked. If snapshot and restart happens, the machine restarts from the second instruction and gets an explicit refusal.
  2. Callback: after a long operation is completed, the object environment receives a callback from the kernel.
  3. Pseudo-ending operation: A blocking call works exactly as a blocking call - it goes into the kernel and waits there indefinitely. But before this, the call pretends that it has completed - puts the zero reference on the stack, as if the implementation had written return null; and, at the end of the work, removes this null and replaces it with the actual result.


Now implemented the last way. The remaining two, I found it extremely inconvenient to use.

It is also imperfect - it would be better, of course, to restart the request when the kernel is restarted, rather than fail. In principle, this is also realizable.

I will explain in more detail why all this is important.

The virtual machine (interpreter) works in a loop, executing instructions for instructions. When the kernel is restarted, the virtual machines restart - the kernel runs through all .internal.thread objects and starts them. In the state in which they were at the time of launching the kernel.

What is this condition? This is the state they were in when snapshots were formed. Obviously, this moment should be such that, conditionally, a longjmp from this point to the entry point of the interpreter function does not inflict fatal wounds on the system.

( Where control arrives after restart )

Accordingly, if we block the interpreter thread (or the JIT code, it does not matter), we need the state to be complete - always on the instruction boundary, and everything that is important for the object environment lies in the persistent memory.

(Full code: A function that implements a blocking call )

What is being done for this.

For a start, let's pretend that the virtual machine instruction is executed. That is, I read the parameters from the stack and put an “erroneous” return code on the stack, null. If we get into snapshot and then we are killed, this will be the result of the instructions in the saved state of the virtual machine.

  int n_param = POP_ISTACK; CHECK_PARAM_COUNT(n_param, 2); //  exception,   2  int nmethod = POP_INT(); pvm_object_t arg = POP_ARG; // push zero to obj stack pvm_ostack_push( tc->_ostack, pvm_create_null_object() ); 


Then we are almost free to do what we want. Nearly.

The fact is that to form a snapshot, the virtual memory subsystem code stops all virtual machine threads and checks that they have stopped. If he is doing it right now or will do it later - we will tell him that we seem to have stopped - we really stopped executing the virtual machine code. Yes - and before all this we will say to the interpreter that all its cached variables are put back into the objects in which they must lie (save_fast_acc). At the same time, we will check if there is a request to stop virtual machines (shutdown) at all - if there is, then we will execute it.

  pvm_exec_save_fast_acc(tc); // Before snap if(phantom_virtual_machine_stop_request) hal_exit_kernel_thread(); hal_mutex_lock( &interlock_mutex ); phantom_virtual_machine_threads_stopped++; phantom_virtual_machine_threads_blocked++; hal_cond_broadcast( &phantom_snap_wait_4_vm_enter ); hal_mutex_unlock( &interlock_mutex ); 


Execute the query itself. When finished, release the variable (reference to the argument).

  // now do syscall - can block pvm_object_t ret = syscall_worker( this, tc, nmethod, arg ); ref_dec_o( arg ); 


Let us inform the snapshot subsystem that we have finished our work and want to go to the interpreter again. If she objects (snapshot goes) - we will sleep until she wakes us up.

  hal_mutex_lock( &interlock_mutex ); if(phantom_virtual_machine_snap_request) hal_cond_wait( &phantom_vm_wait_4_snap, &interlock_mutex ); phantom_virtual_machine_threads_stopped--; phantom_virtual_machine_threads_blocked--; hal_cond_broadcast( &phantom_snap_wait_4_vm_leave ); hal_mutex_unlock( &interlock_mutex ); 


Everything is done, we remove the fake return value from the stack of the virtual machine, and write down the real one.

  // pop zero from obj stack pvm_ostack_pop( tc->_ostack ); // push ret val to obj stack pvm_ostack_push( tc->_ostack, ret ); 


In general, this implementation works. But there is a subtle error in it. The snapshot generation system checks not only that all threads fell asleep before snapshot formation, but also that they all woke up. It is easy to see that if some thread is blocked forever, it will also stop snapshots forever (because it will not “wake up”).

An attempt to solve this problem in the forehead (count the number of blocked threads and take into account when calculating stopped / running ones) did not lead to success: the integrity of the object state is violated.

Perhaps there are pitfalls in the decision that I don’t see yet.

On this I still stop the allowed speeches and go to drink tea. :)

Source: https://habr.com/ru/post/282263/


All Articles