Once again about the architecture of network demons

Many articles, including those in Habré, mention and even describe various ways of building the architecture of network services (daemons). Moreover, few of the authors have real experience in creating and optimizing demons working with tens of thousands of simultaneous connections and / or gigabit traffic.

Since most authors do not bother to even get into the documentation, usually in such articles all information is based on certain rumors and retellings of rumors. These rumors roam the network and hit Wikipedia, habrahabr and other respected resources. The result is an opus like " You must be joking, Mr. Dahl, or why Node.js " (author's punctuation is preserved): it is basically correct in essence, but it is replete with inaccuracies, contains a number of factual errors, and depicts an object from some incomprehensible view angle.

It was difficult for me to pass by an article replete with phrases like "effective polling implementations for today exist only in * nix-systems" (as if poll () is somewhere other than some * nix). This post began as a comment explaining the esteemed inikulin errors in his article. In the process of writing it turned out that it is easier to state the subject from the very beginning, which I actually do as a separate post.
In my essay there is no disruption of covers or any unknown tricks, here the advantages and disadvantages of different approaches are simply described by the person who tested how it all works in practice in different operating systems.
For those who want to clarify - welcome under cat.

TK for network daemon

First you need to understand exactly what network services should do and what, in general, is the problem.
')
Any daemon must accept and process network connections. Since the TCP / IP protocol stack grew from UNIX, and this OS confesses the dogma “everything is a file”, network connections are special type of files that can be opened, closed, read and written with standard OS functions for working with files. Pay attention to the modal verb "can", in conjunction with the word "theoretically", he very accurately describes reality.

So, first of all, any daemon calls the system functions socket (), then bind (), then listen (), and eventually it gets a file of a special type “listening socket”. The parameters of these functions and further actions of the daemon are very dependent on the transport protocol used (TCP, UDP, ICMP, RPD ...), however, in most OSs you can only bind the first two. In this article, as an example, we consider the most popular TCP protocol.

Although a listening socket is a file, all that can happen to it is periodically occurring events like "incoming connection request". The daemon can accept such a connection with the accept () function, which will create a new file, this time already of the “open TCP / IP network socket” type. Supposedly, the daemon should read the request from this connection, process it and send the result back.

At the same time, a network socket is already a more or less normal file: although it was not created in a very standard way, at least you can try to read and write data from it. But there are significant differences from the usual files located on the file system:
* All events happen really asynchronously and with an unknown duration in time. Any operation at worst can take tens of minutes. Generally any.
* Connections, unlike files, can be closed "by themselves" at any, most unexpected moment.
* The OS does not always report a closed connection, "dead" sockets can hang for half an hour.
* Connections on the client and on the server are closed at different times. If a client tries to create a new connection and “send back” data, duplication of data is possible, and if the client is incorrectly written, they can be lost. It is also possible for the server to have several open connections from one client.
* Data is regarded as a stream of bytes and can literally come in portions of 1 byte. Therefore, it is impossible to count them, for example, as UTF-8 strings.
* There are no buffers other than those provided by the daemon itself on the network. Therefore, writing to the socket even 1 byte can block the daemon for tens of minutes (see above). In addition, the memory on the non-rubber server, the daemon should be able to limit the speed with which the results are generated.
* Any errors can happen anywhere; the daemon must correctly handle them all.

If you write a cycle for all open connections "in the forehead", then the very same "suspended" connection will block all the others. Yes, for tens of minutes. And here there are various options for organizing the interaction of various modules of the demon. See the picture:

Disclaimer : The figure shows a pseudo-language code that does not correspond to reality. Many important system calls and all error handling code are omitted for clarity.

2. Multi-process architecture

The simplest way to prevent connections from affecting each other is to run a separate process for each of them (that is, a separate copy of your program). The disadvantages of this method are obvious - the launch of a separate process is a very resource-intensive operation. But most articles do not explain why exactly this method is used in the same Apache.

And the thing is that the process in all operating systems is a unit of accounting for system resources - memory, open files, access rights, quotas, and so on. If you create a remote access daemon to an operating system like Shell or FTP, you simply have to start a separate process on behalf of each logged in user in order to properly consider file permissions. Similarly, on the shared-hosting server at the same time, hundreds of different user sites are spinning on the same “physical” port - and apache is needed for processes so that the websites of some hosting users cannot get into the data of other users. The use of processes does not really affect the performance of Apache:

On the chart - the number of requests processed to a static file per second, depending on the version of the Linux kernel. More is better.
Testbed: Core i7 970, 3Gb DDR3, Nvidia GTX 460, 64GB OCZ Vertex SSD.
Source: Phoronix.
I wish your demons to give 17k files per second.

Even if the users of your daemon are not registered in the operating system, but your service is subject to increased security requirements, there is a very reasonable architectural solution to allocate a separate process for each user. This will not allow bad users to get or block access to other users' data, even if they find a bug in your daemon that allows you to read other people's data or simply destroying the daemon process.

Finally, each process has its own address space, and different processes do not interfere with each other's use of memory. Why this advantage is explained in part

3. Multi-threaded architecture

Threads are the easiest “processes” with shared memory, system resources and access rights, but different stacks. This means that the threads have common dynamic, global and static variables, but different local ones. In multiprocessor and / or multi-core systems, different threads of the same process can run physically at the same time.

A multi-threaded architecture is similar to a multi-process, in which security and stability are sacrificed for performance and reduced memory consumption.

The main advantage of a multi-threaded architecture after performance is the consistency and synchronicity of the open connection processing algorithm. This means that the algorithm looks and executes exactly as shown in the illustration in the first part. First, data is read from the socket, as much time as is needed for this, then they are processed - again, as much time as this processing will require, then the results are sent to the client. At the same time, if you start sending results too quickly, the stream will automatically be blocked on the write () function. The processing algorithm is simple and clear at least at the topmost level. This is a very, very big plus.

For a relatively small number of simultaneous connections, a multi-threaded architecture is an excellent choice. But if there are really many connections, say ten thousand, switching between threads starts to take too much time. But even this is not the main drawback of a multi-threaded architecture.

And the main thing is that the threads are not independent and can (and will) block each other. To understand how this happens, consider an example.

Suppose we need to calculate the value of an expression
a = b + c;
where a, b and c are global variables.

In a normal, single-threaded situation, the compiler will generate something like this machine code:

a = b; // MOV A, B
a += c; // ADD A, C

In the multi-threaded version, you cannot use this code. Another thread can change the value of b between the first and second instructions, with the result that we get the wrong value of a. If somewhere else it is considered that a is always equal to b + c, a very difficult to reproduce floating error will arise.

Therefore, in a multithreaded version, a code like this is used:

lock a;
lock b;
lock c;
a = b;
a += c;
unlock c;
unlock b;
unlock a;

where lock and unlock are atomic blocking and unlocking operations for a variable. They are arranged in such a way that if a variable is already blocked by another thread, the lock () operation will wait for it to be released.

Thus, if two threads start simultaneously performing operations a = b + c and b = c + a, they will block each other forever. Such a situation is called a clinch; the search and resolution of clinches is a separate “sore subject” of parallel programming. But even without clinches, the flows, if they do not relieve blocking quickly, can stop each other for quite long periods of time.

In addition, atomic operations are physically implemented with the help of exclusive trapping of the RAM bus and the RAM itself. Working directly with the memory, and not with the cache, is very slow by itself, and in this case also causes the invalidation (reset) of the corresponding cache lines of all the cores of all the other server processors. That is, even in the best case, in the absence of locks, each atomic operation takes quite a long time and degrades the performance of the other threads.

But it would seem, since the connections in the demons are almost independent, where can they get common variables from?
But from where:
* Common queue of new connections;
* Shared queue access to a database or similar resources;
* Common queue of requests for memory allocation (yes, malloc () and new () can cause blocking);
* General log (log file) and general statistics calculation objects.
These are only the most obvious.

In some cases, there are ways to do without common variables. For example, locks on the queue of new connections can be abandoned if one is to be given a function of a “dispatcher”, which will distribute tasks in some clever way in advance. Sometimes it is possible to apply special "non-blocking" data structures. But in general, the problem of deadlocks in a multi-threaded architecture has not been solved.

4. Non-blocking architecture

Ideally, the number of threads in an application should be approximately equal to the number of processor cores in order of magnitude. One of the mechanisms to achieve this is non-blocking I / O.

Non-blocking I / O is simply a file access mode that can be installed on most modern operating systems. If in the normal, “blocking” mode, the read function reads from the file as many bytes as the programmer ordered, and while this reading takes place — “puts to sleep” the stream that caused it, then in the nonblocking mode, the same read function reads not from the file, but from its cache, as many bytes as there are in this cache, and after that it returns immediately, no threads are lulling or blocking. If the cache was empty, non-blocking read () reads 0 bytes, sets the system error code to EWOULDBLOCK, and returns immediately. But still it is a normal synchronous call to a normal synchronous function.

Some confusion, in particular, in the English-language Wikipedia, in which non-blocking synchronous I / O is called "asynchronous", caused, apparently, not very curious apologists of Linux. In this operating system, for quite a long time, right up to kernels 2.6.22-2.6.29, there simply were no asynchronous I / O functions at all (and even now there is not the entire required set, in particular, there is no asynchronous fnctl), and some programmers who wrote only under this OS, non-blocking synchronous functions were mistakenly called “asynchronous”, which can be traced in a number of old Linux manuals.

Asynchronous I / O is discussed in detail in the next section, and here we focus on the use of non-blocking read and write functions.

In real conditions, 95% of non-blocking read () calls will read 0 bytes each. To avoid these “idle” calls to the OS kernel, there is a select () function that allows you to ask the operating system to select from your list of connections those from which you can already read and / or write. Some * nix OSs have a variant of this function called poll ().

More on poll (): this feature appeared as a requirement for the next version of the POSIX standard. In the case of Linux, poll () was first implemented as a function of the C standard library (libc> = 5.4.28) in the form of a wrapper over the usual select (), and only after some time “moved” to the kernel. In Windows, for example, there is still no normal poll () function, but since Vista there is some kind of palliative to simplify the migration of applications , also implemented as a wrapper around select () in C.
I can not share the schedule, showing what all these innovations lead to. On the graph - the time of pumping 10 GB of data through the interface loop, depending on the version of the kernel. Less is better. The source is the same, the test bench is the same.

In any case, although select () has certain limitations (in particular, on the number of files in one request), using this function and non-blocking I / O mode is a way to transfer all the work to the operating system and simply process your data. In most cases, the non-blocking I / O thread will consume only a couple of percent of the computational power of a single core. Performing all the calculations, the internal threads of the operating system kernel will be “eaten” ten times more.

Let's go back to reducing the number of threads.
So, in the daemon there are a large number of objects of the "connection" class, and there is a set of operations that need to be applied to each connection object, and in the necessary order.

In a multi-threaded architecture, a separate thread is created for each connection object, and operations are naturally performed in blocking mode in the correct sequence.

In a non-blocking I / O architecture, a stream is created for each operation, which is sequentially applied to different objects. It is a bit like SIMD instructions like MMX and SSE: one instruction applies to several objects at once. To withstand the necessary sequence of operations (that is, first calculate the result, and then send it), job queues are created between threads in the shared memory of the process. Queues are usually created on the basis of ring buffers, which in this case can be implemented in a “non-blocking” manner.

In a real network service, between reading a request and sending a result, there will be a rather complex branched processing algorithm, possibly involving a call to an application server, a DBMS or other “heavy” operations, as well as a full set of branching, loops, error handling at each step, etc. To break it all down into an unknown number of threads running at the same time, and even so that the load on the processor cores is about the same - this is the highest level of developer skills that requires virtuosity in all aspects of system programming. In most cases, they make it much simpler: they enclose everything between read () and write () in a separate thread, and run N = the number of kernels of copies of this thread. And then they invent crutches for the clinch, competing for resources, “killing” the DBMS, etc. parallel threads.

5. Asynchronous I / O

If you do not understand the difference between asynchronous functions and synchronous functions, then for simplicity you can assume that the asynchronous function runs in parallel and simultaneously with the program that called it, for example, on the neighboring core. On the one hand, the calling program does not need to wait for the completion of the calculations and it can do something useful. On the other hand, when the results of the asynchronous function are ready, you need to somehow inform the customer program. The way this message happens is implemented in different operating systems in very different ways.

Historically, Windows 2000 was one of the first OSs that support asynchronous I / O.
A typical use case was this: a single-threaded application (there were no multicore processors at that time) loads, for example, a large file for tens of seconds. Instead of freezing the interface and the “watch” that would be observed when synchronously calling read (), in the asynchronous version, the main program thread does not “hang”, you can make a beautiful progress bar displaying the boot process and the “cancel” button.
To implement the progress bar, a special OVERLAPPED structure is passed to the asynchronous I / O functions of Windows, in which the OS marks the current number of bytes transferred. The programmer can read the contents of this structure at any convenient time - in the main message processing loop, by timer, etc. In the same structure, at the end of the operation, its final result will be recorded (total number of bytes transferred, error code if it is, etc.).

In addition to this structure, you can pass your own callback functions to the asynchronous I / O functions that take a pointer to OVERLAPPED, which will be called by the operating system at the end of the operation.
A real, honest asynchronous start of a callback function, through program interruption in any place wherever it is, is not distinguishable from the launch of the second program execution thread on the same core. Accordingly, it is necessary either to write callback functions very carefully, or to apply all “multi-threaded” rules regarding access locks to shared data, which, you see, is very strange in a single-threaded application. To avoid potential errors in single-threaded applications, Windows puts unprocessed callbacks in the queue, and the programmer must explicitly indicate the places in his program where it can be interrupted to execute these callbacks (the WaitFor * Object family of functions).

The asynchronous I / O scheme described above is native for the WindowsNT kernel, that is, all other operations are implemented through it one way or another. Full name - IOCP (Input / Output Completion Port). It is believed that this scheme allows to theoretically achieve maximum performance from iron. Any daemons designed for serious work under Windows should be developed on the basis of IOCP. For details, see the introduction to IOCP in MSDN .

In Linux, instead of the normal OVERLAPPED structure, there is some weak similarity to aiocb, which allows to determine only the fact of completion of the operation, but not its current progress. Instead of user-defined callbacks, the kernel uses UNIX signals (yes, those that are kill). Signals come completely asynchronously, with all the consequences, but if you don’t feel like a guru in writing reentrant functions, you can create a special type of file (signalfd) and read information about incoming signals from it using regular synchronous I / O functions, including non-blocking . For details, see man aio.h.

The use of asynchronous I / O does not impose any restrictions on the architecture of the daemon, in theory it can be any. But, as a rule, several workflows are used (according to the number of processor cores), between which the serviced connections are evenly distributed. For each connection, a finite state machine is built and programmed (Finite State Machine, FSM), the occurrence of events (calls to callback functions and / or errors) transfers this automaton from one state to another.

Summary

As we see, each method has its advantages, disadvantages and areas of application. If you need security, use processes, if speed is important for high loads — non-blocking I / O, and if development speed and code is important, then a multi-threaded architecture will do. Asynchronous I / O is the primary method in Windows. In any case, do not try to write the code for working with input-output independently. The network has free ready-made libraries for all architectures and operating systems, licked for decades almost to shine. Almost - because in your case, you still have to twist something, file and adjust to your conditions. The Internet is a complicated thing, there are no universal solutions for all occasions.

Be that as it may, the demons do not get by with just I / O, and much more complicated “gags” occur during the processing of requests. But this is a topic for another article, if it is interesting to anyone.

Links

1. The C10K problem (English), thanks for the tip o_O_Tync
2. Help to the library libev , the tasty part with a description of the various mechanisms of mass input-output (eng), thanks for the tip saterenko
3. FAQ ehi ru.unix.prog
4. Introduction to the unix library libaio (eng.)
5. Testing the performance of kernels from 2.6.12 to 2.6.37 (eng).

Source: https://habr.com/ru/post/108294/

All Articles