Poorly Documented Linux Features

Having taken a breath, said:
"How long have I slept!"

Once, when I first met Unix, I was fascinated by the logical harmony and completeness of the system. A few years after that, I fiercely studied the kernel device and system calls, reading everything I could get. Little by little my passion has come to naught, there were more pressing matters and now, starting from a certain time, I began to discover that one other feature about which I did not know before. The process is natural, but too often such incidents are united by one thing - the absence of an authoritative source of documentation. Often the answer is in the form of the third top comment on stackoverflow , often you have to bring together two or three sources to get the answer to exactly the question you asked. I want to bring here a small collection of such poorly documented features. None of them is new, some are not very new, but for each I killed several hours at one time and often still do not know a systematic description.

All examples are related to Linux, although many of them are valid for other * nix systems, I simply took the most actively developing OS as a basis, besides the one that I have in my eyes and where I can quickly check the proposed code.

Please note that in the title I wrote “poorly documented” and not “obscure,” so I’m glad to add those who are aware of posting links to articulate documentation in the comments in the comments.

Does the freed memory return to the OS?

This question, asked by a respected colleague I have, served as a trigger for this publication. For half an hour after that, I mixed it with mud and called it comparative epithets, explaining what else the classics had learned - memory in Unix is allocated via the sbrk () system call, which simply increases the upper limit of available addresses; usually emitted in large chunks; that of course it is technically possible to lower the limit and return the memory to the OS for other processes, however, it is very expensive for the allocator to keep track of all used and unused fragments, so the return of memory is not provided by design . This classic mechanism works fine in most cases, the exception is that the server sits quietly idle for hours / months, suddenly requesting a lot of pages to handle some event and then quietly falling asleep (but in this case it helps the swap). After that, having satisfied my CSW, I, ~~as an honest person,~~ went to confirm my opinion on the Internet and was surprised to find that Linux starting from 2.4 can use both sbrk () and mmap () to allocate memory, depending on the size requested. Moreover, the memory allocated via mmap () is completely returned to the OS after calling free () / delete. After such a blow, I had only ~~one~~ two left - to humbly apologize and find out what exactly this mysterious limit is exactly equal to. Since I did not find any information, I had to measure with my hands. It turned out, on my system (3.13.0) - only 120 bytes. Line code for those who want to try on - here .
')

What is the minimum interval a process / thread can sleep?

The same Maurice Bach taught: the scheduler ( scheduler ) of processes in the core is activated by any interrupt; after receiving control, the scheduler goes through the list of sleeping processes and transfers those that are awake (received the requested data from a file or socket, expired the sleep () interval, etc.) into the “ready to run” list, and then goes back out of the interrupt into the current process. When the system timer is interrupted, which happened once every 100 ms, then, as the CPU speed increased, every 10 ms, the scheduler puts the current process at the end of the “ready to run” list and starts the first process from the beginning of this list. So, if I called sleep (0) or fell asleep for a moment for any reason, so my process was rearranged from the “ready to run” list to the “preempted” list, it has no chance of earning again before 10 ms , even if it is generally alone in the system. In principle, the core can be rebuilt by reducing this interval, but it causes unnecessarily high CPU costs, so this is not an option. For many years this well-known limitation has poisoned the life of the developers of fast-response systems, and it was this that greatly stimulated the development of real-time systems and non-blocking ( lockfree ) algorithms.

And somehow I repeated this experiment (I was actually interested in more subtle moments like the probability distribution) and suddenly I saw that the process wakes up after sleep (0) after 40 μs, 250 times faster. The same after calls yield (), std :: mutex :: lock () and all other blocking calls. What is happening ?!

The search quickly led to the Completely Fair Scheduler introduced since 2.6.23, but for a long time I could not understand exactly how this mechanism leads to such a fast switch. As I found out in the end, the difference lies precisely in the algorithm of the default scheduler class , the one under which all processes run by default. In contrast to the classic implementation, in this, each running process / thread has a dynamic priority, so that for a running process, the priority gradually decreases relative to other pending execution. Thus, the scheduler can decide to start another process immediately, without waiting for the end of a fixed interval, and the algorithm for iterating the process itself is now O (1), much easier and can be executed more often.

This change leads to surprisingly far-reaching consequences; in fact, the gap between real-time and the conventional system has almost disappeared, the proposed delay of 40 microseconds is really quite small for most applications, the same can be said about non-blocking algorithms - the classic blocking data structures on steel mutexes very competitive.

And what is all these classes scheduler ( scheduling policies )?

This topic is more or less described, I will not repeat myself, and nevertheless, we will open one and the second authoritative books on the corresponding page and compare them among themselves. There is almost literal repetition of each other in some places, as well as some discrepancies with what man -s2 sched_setscheduler says . However, the symptom.

Let's just play around with the code then. I create several threads with different priorities, hang them all on the mutex and wake up all at once. I naturally expect that they will wake up in strict accordance with their priority:

iBolit# ./sche -d0 -i0 -b0 -f1 -r2 -f3 -i0 -i0 -i0 -d0 6 SCHED_FIFO[3] 5 SCHED_RR[2] 4 SCHED_FIFO[1] 1 SCHED_OTHER[0] 2 SCHED_IDLE[0] 3 SCHED_BATCH[0] 7 SCHED_IDLE[0] 8 SCHED_IDLE[0] 9 SCHED_IDLE[0] 10 SCHED_OTHER[0]

The number at the beginning of the line indicates the order in which the threads were created. As you can see, the two priority classes SCHED_FIFO and SCHED_RR always take precedence over the three regular classes SCHED_OTHER, SCHED_BATCH and SCHED_IDLE, and are ranked strictly according to priority among themselves, which is what was required. But for example, the fact that all three user classes at the start are equal in rights is not mentioned anywhere at all, even SCHED_IDLE, which is much affected by rights compared to default SCHED_OTHER, runs ahead of it if it is first in the queue on the mutex. Well, at least in general, everything works, but

at Solaris in this place in general a hole

Several years ago, I drove this test under Solaris and found that thread priorities are completely ignored, threads awaken in a completely arbitrary order. I then contacted Sun tech support, but received a surprisingly unintelligible and empty response (before that, they willingly cooperated with us). Two weeks later, Sun was gone . I sincerely hope that it was not my request that caused this.

For those who want to deal with priorities and classes, the source code is also there .

TCP delayed packets

If the previous examples can be considered a pleasant surprise, then this one is hardly pleasant to call.
The story began several years ago when we suddenly discovered that one of our servers, sending clients a continuous stream of data, experiences periodic delays of 40 milliseconds. This happened infrequently, however, we couldn’t afford this luxury, so a ritual dance with a sniffer and subsequent analysis was performed. Attention , when discussing on the Internet, this problem is usually associated with the Nagle algorithm , incorrectly , according to our results, a problem arises on Linux during the interaction of delayed ACK and slow start . Let's recall another classic, Richard Stevens , to refresh his memory.
delayed ACK is an algorithm for delaying sending an ACK to a received packet for several tens of milliseconds in the expectation that a response packet will be sent immediately and an ACK can be embedded in it with the obvious goal of reducing the traffic of empty datagrams over the network. This mechanism works in an interactive TCP session and in 1994, when TCP / IP Illustrated came out, was already a standard part of the TCP / IP stack. What is important for further understanding is that the delay can be interrupted in particular by the arrival of the next data packet, in which case a cumulative ACK is sent to both datagrams immediately.
slow start is an equally old algorithm designed to protect intermediate routers from an overly aggressive source. The sender at the beginning of the session can send only one packet and must wait for the ACK from the recipient, after which it can send two, four, etc., until it reaches the other control mechanisms. This mechanism obviously works in the case of volume traffic and, significantly, it is activated at the beginning of the session and after each forced retransmission of the lost datagram.
TCP sessions can be divided into two large classes - interactive (such as telnet ) and volume ( bulk traffic, such as ftp ). It is easy to see that the requirements for traffic-controlling algorithms in these cases are often opposite, in particular, the requirements to “delay ACK” and “wait for ACK” obviously contradict each other. In the case of a stable TCP session, the condition mentioned above is saved - the receipt of the next packet interrupts the delay and the ACK is sent to both segments without waiting for the associated data packet. However, if suddenly one of the packets is lost, the sending side immediately initiates a slow start — sends one datagram and waits for a response, the receiving side receives one datagram and delays the ACK, since no data is sent in response, the entire exchange hangs for 40 ms. Voilà.
The effect occurs exactly in Linux - Linux TCP connections, in other systems I have not seen this, it looks like something in their implementation. And how to deal with it? Well, in principle, Linux offers the (non-standard) option TCP_QUICKACK , which disables the delayed ACK , but this option is not stable, it turns off automatically, so you have to check the flag before each read () / write () . There is / proc / sys / net / ipv4 , in particular / proc / sys / net / ipv4 / tcp_low_latency , but whether she does what I suspect she should do is unknown. In addition, this checkbox will apply to all TCP connections on this machine, not good.
What are the suggestions?

From the darkness of ages

And finally, the very first incident in the history of Linux, just to complete the picture.
From the very beginning, Linux had a non-standard system call - clone () . It works like fork () , that is, it creates a copy of the current process, but the address space remains shared. It is not difficult to guess why it was invented and, indeed, this elegant solution immediately pushed Linux into the front ranks among the operating systems for multithreading. However, there is always one nuance ...

The fact is that during the cloning process, all file descriptors, including sockets, are also cloned. If previously there was a worked out scheme: a socket is opened, transferred to other threads, everyone cooperates sending and receiving data, one of the threads decides to close the socket, all others immediately see that the socket is closed, at the other end of the connection (in the case of TCP) they also see that the socket closed; what happens now? If one of the threads decides to close its socket, the other threads do not know anything about it, since they are actually separate processes and they have their own copies of this socket, and continue to work. Moreover, the other end of the connection also considers the connection open. A thing of the past, but once this innovation broke the pattern of many network programmers, and the code had to be rewritten for Linux pretty.

Literature

This could be your link to the topics covered.

The first swallows:

@ Sov1et: TCP / IP Architecture, Design and Implementation in Linux is on the free e-books list. I did not find the translation
By the recommendation of a5b , I add links to mans . Useful feature : there is a group by project. For lovers to learn by man'am.

And I really wonder how much I slept all the same and how far behind the times I was. Let me include a small survey.

Source: https://habr.com/ru/post/253811/

All Articles