We offer you a great New Year reading for programmers :) The article by
Alexander Chistyakov (
alexclear ), which he wrote inspired by the theses of the
Mons Anderson report (
codesign ) on HighLoad ++ 2017.

Let's talk about the principles of asynchronous solutions and consider the classification
proposed by Mons Anderson . We may be able to offer our own
classification .
')
In order to classify existing solutions, we first invent the coordinate axes. From the point of view of the development engineer, the “
synchronous ” and “
asynchronous ” paradigms are based on abstractions that differ in both complexity of application and “efficiency” (we still have to determine what “efficiency” is).
Carefully, under the cut
hard hardcore !
As for the complexity of the application, it is believed that synchronous programming is simpler than asynchronous. First, using the synchronous paradigm does not disturb the order of instructions in the algorithm. Secondly, it is easier to control the scope of variables.
Asynchronous paradigm
In order to understand what tasks the asynchronous paradigm solves, consider the work of the program at a low level.
OS Scheduler
From the point of view of the operating system, the program's algorithm utilizes computer resources, the main ones being the processor (CPU) and memory. The operating system scheduler is responsible for assigning the process to one of the processor cores.
When changing a process on the processor core, it is necessary to perform a context switch, saving all the current processor registers to memory and recovering the processor registers from the memory corresponding to the process or thread destined for execution. This operation takes about 200 processor cycles for modern processors (which is quite expensive).
I / O operations are performed using the DMA mechanism, while the CPU is not involved in the process. Despite the fact that in Linux, I / O processes are formally accounted as assigned to the CPU, in fact, such processes do not occupy the processor and end up in the uninterruptible sleep (D-state) state. At the time of implementation of I / O by one process, the scheduler assigns another process to the processor.
From the point of view of the OS scheduler, OS threads (at least in Linux) are just processes - each of them has its own independent context, blocking I / O operations from one of the threads does not affect the work of other threads.
Thus, the removal of the OS thread from the execution on the processor can occur in two cases:
- the time slot allotted for execution has ended;
- The thread performed a synchronous I / O operation (one that must wait for its completion).
Java
The classical (as in modern Java) model of synchronous parallel processing is focused on using operating system threads as runtime (virtual machine) language. Strictly speaking, runtime and virtual machine are different concepts, but for us it is not important.
If our program serves client connections, then
each client will require a separate operating system thread to maintain the connection . In this case, the number of context switches between operating system threads will be proportional to the number of client connections (see “efficiency” above).
Suppose we need to serve 10,000 client connections — we have to generate 10,000 operating system threads, which is technically possible.
Perl
Such a multithreading model, as in Java, does not work in most modern language interpreters (Python, Ruby (threads in Ruby are not operating system threads at all)) because of GIL. Standard threading in Perl is implemented according to the same principles as in Java - the Perl stream is the OS thread, but each thread is executed by its own interpreter (see "efficiency" again - this is much less efficient than threads in Java).
Suppose we need to serve the same 10,000 connections — now we have to generate 10,000 separate copies of the Perl interpreter, which is technically impossible.
N: 1
In order to optimize context switches for mass I / O for a large number of connected clients, instead of OS threads, emulation of the cooperative multitasking mechanism is used directly at the runtime level or programming language interpreter.
In this case, the “threads” exist only within the interpreter process and the OS scheduler knows nothing about them. Therefore, the interpreter itself must do the switching between such threads, and it does this in two cases: the thread explicitly executes the yield command or a similar, transferring control to another thread, or the thread performs an I / O operation. Such a multithreading model is called “N: 1” -
several threads of the OS kernel correspond to several threads of the interpreter level .
However, if the I / O operation is synchronous, the OS-level flow will fall into the D-state and will be removed from the execution on the processor until the end of the I / O operation. This will lead to the fact that all N threads running in this OS thread will be blocked until the end of the I / O operation in one of them (see "efficiency").
Callbacks
Fortunately, the core has an asynchronous (with some reservations) I / O mechanism, using which the calling OS-level thread does not wait for the end of the I / O operation, but continues execution. At the same time, at the end of the I / O operation, the callback registered by the user will be called.
To use asynchronous I / O, it is sufficient to put the socket in asynchronous mode using the
fcntl (2) system call.
Imagine that we need to serve 10,000 connections — to do this, we need to execute the read command on 10,000 open sockets.
In order to increase efficiency, a mechanism was invented that allows to combine open file descriptors into a common data structure in the kernel and to perform asynchronous I / O operations on a group of sockets. Initially, the
select (2) system call was such a mechanism. The problem with the select call is that when an event occurs, it is necessary to cycle through all registered file descriptors in order to determine which of them caused the event — the algorithmic complexity of such enumeration is proportional to the number of open file descriptors.
In order to guarantee the constant algorithmic complexity of finding the right sockets, the mechanisms
kqueue (FreeBSD) and
epoll (7) (Linux) were implemented.
When using epoll, the OS execution thread is busy registering / deleting open file descriptors and preparing asynchronous I / O calls, as well as processing triggered callbacks. If your program does not use yield, then it becomes critical to prevent CPU-intensive calculations between I / O operations, as this will disrupt the fair distribution of processor resources between threads of the interpreter level (or runtime).
Golang and Node.JS
We have just described the mechanism of the Golang language runtime. The only difference is that the multithreading mechanism in Golang is not N: 1, but N: M, where M is the number of processor cores. The rantaym Golang can automatically switch
gorutiny not only at the time of I / O, but also at the time of calling other functions (while the infinite loop utilizes 100% of the processor time in the corresponding OS thread and will never be stopped by runtime).
The
Node.JS interpreter
is also built around the
epoll (more precisely, around the code from nginx), only it uses
the N: 1 model and then the single core does not scale.
In some cases, a scheduler similar to the Golang runtime scheduler is implemented as a library or transpiler (for example,
Coro in Perl or
async / await in JS using Babel), which allows the use of corutines in languages ​​that lack their support at the interpreter level.
Attempt to classify
Based on the above, I would suggest the following classification of multi-threaded schemes:
- Classical implementation of runtime threads through OS threads;
- Implementation of corutin of the form N: 1 or N: M;
- Low-level work with asynchronous I / O by manually registering callbacks and writing the appropriate noodles (do not forget to create a hashmap for contexts somewhere).
Mons classification around HTTP requests processingNow to the
classification of Mons . As I understand it, it is built around the task of processing HTTP requests and uses the classical terminology of the Apache web server.
Apparently, a
single process server is just a synchronously working server that can process only one request at a time.
The forking server is a server that generates a separate process for each request being processed (see “efficiency”, in Linux, fork (2) uses the CoW mechanism, and that would be even worse).
The preforking server is a classic of the Apache world, the creation of workflows in advance in a given quantity, the processing is still synchronous.
About the fact that callbacks are worse than coroutine, we spoke above to feel the difference on ourselves, write the code first with callbacks, and then with corutines, or just study the source code of nginx. Why do people want to make code on callbacks more than once in their lives, I do not know, it would be better to register in the section for mountaineering or parachuting.
What is an
async prefork is apparently the implementation of
the N: M mechanism when M worker processes are running.
I don’t know what an
async + worker is because the worker differs from the prefork in the Apache world, as far as I remember, in that the worker flows workflows instead of workflows (from the point of view of the OS there is no difference, there is a difference from the point of view of shared state , and mutable shared state - this is the reason why you are depressed first, and then fired).
What is
multithreaded async ? According to my (she is not mine, I myself, lazy and sinful, did not invent anything) the classification is again
N: M , I don’t know why there are three names for the same thing.
We have not defined what “efficiency” is. Not needed.
PS: By the way, the report was not heard then (the report was not ready, although we wanted it very much), but we hope to hear it at the
HighLoad ++ Junior this spring. Where we continue our discussion :)