Implementing fork () without MMU

Hello, reader! A couple of years ago, in an article about vfork (), I promised to tell you about fork () implementation for systems without an MMU, but my hands only reached this point :)

In this article I will tell how we implemented such a strange fork() . I will check the performance on a third-party program - the dash - interpreter, which uses fork() to run applications.

Who cares, I ask under the cat.

A legitimate question may arise - why bother to do some kind of trimmed version of fork() , if vfork() exists for this purpose, which allows you to create processes without using the MMU? In fact, there are a number of applications that do not use the copying of the address space to the fullest, but vfork() not enough for them (remember that when using vfork() in a descendant, even the stack cannot be touched, i.e. return from a function or change local variables).
')
dash - POSIX-compatible lightweight shell - just serves as an example of such a program. If you simplify, when calling third-party programs, dash uses fork() , after which in both processes (parent and child) it modifies some static data, performs some heap operations, then the child process calls the required program using execv() , and the parent calls waitpid() and waiting for the completion of the descendant.

A small educational program for those who do not understand what all these words mean, but for some reason continued to read the article: on UNIX systems, processes are created using the fork() system call. By definition from POSIX , fork() must create an exact copy of the process with the exception of some variables. If successful, the function returns zero to the child process and the number of the child process to the parent (after this, the processes begin to “live their own life”). It turns out that two processes start working in the system with the same variable addresses, the same stack address, and so on. However, the data is not “confused” between processes through the use of virtual memory: a copy of the parent's data is created, and the descendant process refers to other physical addresses. Read more about virtual memory here .

Modern MMUs allow you to set permissions for memory pages; therefore, to create a process, it is sufficient to simply clone the translation table, marking the pages with the flag copy-on-write, and when trying to write to this page, clone the data for real. The absence of the MMU not only makes it impossible to use this optimization, but also calls into question the possibility of implementing fork() in principle. Without hardware support for virtual memory, programs use physical addresses, rather than virtual ones, which means that the process obtained using fork() will inevitably refer to the same data as the parent process.

In a broad sense, the address space of a process is the entire memory mapped for a process. This includes a segment with a program code, a heap, stacks of threads, partly kernel memory, peripheral memory, and so on. In the article, the address space will be understood as a heap, static data (here and further static data refers to data from the .bss and .data sections) and the process stack, that is, data that can be changed in both processes. It is with these data that you will have to deal with in order to avoid confusion between the work of the "forked".

Since this is a fork() implementation in a multitasking OS, the switching of the process context plays an important role. About switching contexts of objects that divide the CPU time, it is written in our article “Organization of multitasking in the OS kernel” . In this case, the address space must also be switched. If in systems with an MMU this is, roughly speaking, a change in the pointer to the translation table, then in the absence of the MMU, it will be more difficult to ensure the correct operation of the processes.

One way is to replace values on the stack, on the heap, and in static memory when switching processes. This, of course, is very slow, but simple enough. Actually, this is the idea for the implementation of our fork() without MMU. We memorize the necessary data for the "forked" processes and when switching, we copy the values into the working address space. For stack, heap, and static data, this is done differently.

Stack

In our OS, memory for stacks of threads is not allocated dynamically, but from a static pool, respectively, some part of .bss contains space for stacks. We are interested only in those stacks, the corresponding flows of which are "forked". Copy processes, as defined by POSIX, should have only one thread (a copy of the one in which the corresponding system call occurred), but fork() can be called from different threads at different times, so you need to keep a list of threads and stacks for each address space. which need to be saved.

A pile

Heap data is stored in a buffer that is allocated from the static page pool.

.bss and .data

Embox (along with all applications) is compiled into one static image, therefore information about the .bss and .data sections is not taken from any ELF file, but is stored in the general section under a unique name (for example, __module_embox__cmd__shell_data_vma ). Information about which data belongs to which application is stored in a special structure, which is set at compile time.

 struct mod_app { char *data; size_t data_sz; char *bss; size_t bss_sz; };

During the execution of the current process, you can find out where its data is stored.

When you run the program directly, you need to copy the corresponding data to a certain area of memory (so that on subsequent program launches the initial values of the variables are correct).

We are going to copy these pieces of memory into the allocated section of system memory.

fork

The above is enough to understand exactly how the fork() call works.
The fork function itself is architecturally dependent. In assembly language, the transfer of registers is implemented as an argument to the fork_body() call, a function that implements the logic of the system call.

Although most readers are more familiar with the x86 command set, I’ll provide an implementation for ARM, as it is much shorter and clearer.

Registers are saved in the pt_regs structure:

 typedef struct pt_regs { int r[13]; /*    */ int lr; /*   */ int sp; /*   */ int psr; /*  , ARM-  */ } pt_regs_t;

Source code of the fork () function

 /*     ,    68  */ sub sp, sp, #68 /*  13       */ stmia sp, {r0 - r12, lr} /*  SP */ str sp, [sp, #56] /*   CPSR    ,   CPSR  r0*/ mrs r0, cpsr; /*  CPSR   */ str r0, [sp, #60]; /*      r0    */ mov r0, sp /*   -   */ b fork_body

As you can see, in fact, the correct SP value is not saved in ptregs, but shifted by 68 bytes (into which we put the pt_regs_t structure). We will take this into account when restoring registers.

The architecture-independent part of the fork () call

 void _NORETURN fork_body(struct pt_regs *ptregs) { struct addr_space *adrspc; struct addr_space *child_adrspc; struct task *parent; pid_t child_pid; struct task *child; assert(ptregs); parent = task_self(); assert(parent); child_pid = task_prepare(""); if (0 > child_pid) { ptregs_retcode_err_jmp(ptregs, -1, child_pid); panic("%s returning", __func__); } adrspc = fork_addr_space_get(parent); if (!adrspc) { adrspc = fork_addr_space_create(NULL); fork_addr_space_set(parent, adrspc); } /* Save the stack of the current thread */ fork_stack_store(adrspc, thread_self()); child = task_table_get(child_pid); child_adrspc = fork_addr_space_create(adrspc); /* Can't use fork_addr_space_store() as we use * different task as data source */ fork_stack_store(child_adrspc, child->tsk_main); fork_heap_store(&child_adrspc->heap_space, task_self()); fork_static_store(&child_adrspc->static_space); memcpy(&child_adrspc->pt_entry, ptregs, sizeof(*ptregs)); sched_lock(); { child = task_table_get(child_pid); task_start(child, fork_child_trampoline, NULL); fork_addr_space_set(child, child_adrspc); thread_stack_set(child->tsk_main, thread_stack_get(thread_self())); thread_stack_set_size(child->tsk_main, thread_stack_get_size(thread_self())); } ptregs_retcode_jmp(ptregs, child_pid); sched_unlock(); panic("%s returning", __func__); }

A call to the ptregs_retcode_jmp() function will result in a return to the parent process. In turn, the child process will use the same call when the process starts.

 static void *fork_child_trampoline(void *arg) { struct addr_space *adrspc; adrspc = fork_addr_space_get(task_self()); fork_stack_restore(adrspc, stack_ptr()); ptregs_retcode_jmp(&adrspc->pt_entry, 0); panic("%s returning", __func__); }

After the child calls execv (), there is no longer any need to support intersecting address spaces, and, accordingly, you do not need to copy anything when switching context.

Health check

Actually, this functionality was quite enough for dash :)

To check in Embox, you can run the template x86 / qemu

 git clone https://github.com/embox/embox.git cd embox make confload-x86/qemu make ./scripts/qemu/auto_qemu

Then we call dash and inside it you can call other commands, for example, ping.

Most likely, “poking” the dash will be able to achieve some exception, do not hesitate to create an issue in our repository :)

Source: https://habr.com/ru/post/347542/

All Articles