Debugging Your OS: A Lesson in Memory Allocation

It all started, like many other investigations, from a bug report .

The name of the report was quite simple: "With an HTTP connection, iter_content works slowly with large chunks." This name immediately turned on the siren in my head for two reasons. Firstly, it is rather difficult to determine what is meant by “slow”? How slow? How big is the “big size”? Secondly, if the described would be shown really seriously, we would already know about it. The iter_content method has iter_content used for a long time, and if it slowed down significantly in the common user mode, such information would not pass by us.

I quickly looked at the report. The author provided few details, but wrote this: "This leads to a 100% processor load and reduces network bandwidth below 1 Mb / s." I jumped at this phrase, because this can not be. Simple downloads with minimal processing are never that slow!

However, all bug reports before refutation deserve investigation. After talking with the author of the report, it was possible to restore the bug manifestation script: if you use Requests with PyOpenSSL and run the following code, the processor will be fully loaded and the network bandwidth will drop to the minimum:
')

 import requests https = requests.get("https://az792536.vo.msecnd.net/vms/VMBuild_20161102/VirtualBox/MSEdge/MSEdge.Win10_preview.VirtualBox.zip", stream=True) for content in https.iter_content(100 * 2 ** 20): # 100MB pass

This is just a wonderful, playable script, because it clearly points to the Requests stack. User-supplied code is not executed here: it is part of the Requests library or one of its dependencies; There is no possibility that this user has written a stupid low-performance code. Real fiction. Even more fantastic is the use of a public URL. I could execute the script ! And, having done this, I ran into a bug. With every execution.

Here was another lovely detail:

At 10 MB, there is no increase in the load on the processor and no effect on throughput. At 1 GB, the processor is loaded at 100%, as at 100 MB, but the bandwidth falls below 100 KB / s, in contrast to 1 MB / s at 100 MB.

This is a very interesting point: it hints that the literal value of the chunk size affects the workload. Considering that this happens only when using PyOpenSSL, and also that most of the time the stack serves the above code, the problem becomes clear:

 File "/home/user/.local/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1299, in recv buf = _ffi.new("char[]", bufsiz)

The investigation showed that the standard behavior of CFFI relative to FFI.new is to return zero memory. This meant a linear increase in redundancy, depending on the size of the allocated memory: larger volumes had to be reset longer. Consequently, bad behavior is associated with the release of large volumes. We took advantage of CFFI's ability to disable resetting these buffers, and the problem went away ¹ . So she solved, right?

Wrong.

Real bug

Joking aside: it really allowed to solve the problem. But a few days later I was asked a very profound question: why was the memory actively reset at all ? To understand the essence of the question, let's digress and talk about memory allocation in POSIX systems.

malloc and calloc and vmalloc, oh well!

Many programmers know the standard way to request memory from the operating system. This mechanism involves the malloc function from the standard C library (you can read the documentation about it for your OS by typing man 3 malloc in the manual). This function takes one argument — the number of bytes of memory to allocate. The standard C library allocates memory using one of several different techniques, but somehow it returns a pointer to a section of memory that is at least as large as the amount you requested.

By default, malloc returns uninitialized memory . That is, the standard C library allocates a certain amount and immediately transfers it to your program, without changing the data that is already there . That is, when using malloc your program will return a buffer to which it has already written data. This is a common cause of bugs in non-memory-safe (memory-unsafe) languages, for example C. In general, reading from uninitialized memory is very risky.

However, malloc has a friend documented on the same manual page: calloc . Its main difference is that it takes two arguments — a counter and a size. Using malloc , you ask for the standard C library: "Please allocate at least n bytes to me." And when you call calloc you ask for it: “Please allocate enough memory for n objects of size m bytes.” Obviously, the primary idea of calling calloc was to safely allocate heaps for arrays of objects ² .

But calloc has a side effect associated with its original purpose for placing arrays in memory. He is very modestly mentioned in the manual.

The allocated memory is filled with zero bytes.

This goes hand in hand with the calloc purpose. For example, if you place an array of values in memory, it will often be very useful for it to initially have a default state. In some modern memory-safe languages, this has already become the standard behavior when creating arrays and structures. Say, when you initialize a structure in Go, then all its members are defaulted to their so-called "zero" values, equivalent to "those values that would be if everything were reset to zero." This can be considered a promise that all Go structures are located in memory using calloc ³ .

This behavior means that malloc returns uninitialized memory, and calloc returns initialized memory. And if so, and even in the light of the aforementioned strict promises, the operating system can optimize the allocated memory. Indeed, many modern operating systems do this.

Use calloc

Of course, the simplest way to implement calloc is to write something like:

 void *calloc(size_t count, size_t size) { assert(!multiplication_would_overflow(count, size)); size_t allocation_size = count * size; void *allocation = malloc(allocation_size); memset(allocation, 0, allocation_size); return allocation; }

The cost of this function varies approximately linearly with respect to the size of the allocated memory: the more bytes, the more expensive they are all zeroed. Today, most operating systems in fact contain standard C libraries that contain optimized paths for memset (usually, specialized processor vector instructions are used that allow a single instruction to reset a large number of bytes at once). However, the cost of this procedure varies linearly.

To allocate large amounts in the OS uses another trick related to virtual memory.

Virtual memory

Here we will not analyze the entire structure and operation of virtual memory, but I highly recommend reading about it (the topic is extremely interesting!). In short: virtual memory is a lie of the OS kernel to processes about available memory. Each executed process has its own idea of memory belonging to it and only to it. This representation is indirectly “mapped” (mapped) to physical memory.

As a result, the OS can scroll through all sorts of tricky tricks. Most often, it gives out for memory special files that are displayed in it (memory-mapped file). They are used to upload the contents of memory to disk, as well as to display memory in them. In the latter case, the program asks the OS: “Please allocate n bytes of memory to me and save them to a file on disk, so that when I write to memory, all the records would be executed into this file, and when I read from memory, the data would be read from it. ”

At the kernel level, it works like this: when a process tries to read from such memory, the processor notifies that the memory does not exist, pauses the process and throws a "page fault" (page fault). The kernel stores actual data into memory so that the application can read them. Then the process is removed from the pause and magically finds the data in the appropriate place. From the point of view of the process, everything happened instantly, without a pause.

This mechanism can be used to perform other subtle tricks. One of them is the "free" allocation of very large amounts of memory. Or, more precisely, to make their cost proportional to the degree of use of this memory, and not to the size allocated.

Historically, many programs that need decent chunks of memory at runtime create a large buffer at startup, which can then be distributed within the program during its life cycle. This was done because programs were written for environments that did not use virtual memory; the programs had to immediately occupy some amounts of memory, so that later they would not have enough of it. But after the introduction of virtual memory, this behavior has become unnecessary: each program can allocate as much memory for itself as is necessary, without tearing out a piece from the mouth of others ⁴ .

To avoid the very high costs of running applications, operating systems began to lie to applications. In most operating systems, if you try to allocate more than 128 KB in a single call, the standard C library will directly ask the OS for completely new virtual memory pages that will cover the requested volumes. But the main thing: this selection is almost worthless . After all, in fact, the OS does nothing: it only reconfigures the virtual memory scheme. So using malloc costs are scanty.

The memory was not “assigned” to the process, and as soon as the application tries to actually use it, a memory page error occurs. The OS intervenes here, finds the page you need and places it where the process goes, just like in the case of a memory error and a mapped file. The only difference is that virtual memory is provided with physical memory, not file.

As a result, if you call malloc(1024 * 1024 * 1024) to allocate 1 GB of memory, it will happen almost instantly, because in fact the memory is not allocated to the process. But programs can instantly “allocate” for themselves many gigabytes, although in reality this would not have happened very quickly.

But even more surprising is that the same optimization is available for calloc . The OS can display a completely new page on the so-called “zero page”: this is a read-only memory page, and only zeros are read from it. Initially, this mapping is copy-on-write: when your process tries to write data to this new memory — the kernel intervenes, copies all zeros to a new page, and then allows you to write.

Thanks to this trick from the OS, calloc can do the same thing as malloc when allocating large volumes, requesting new virtual memory pages. This will be free until the memory is used. This optimization means that the cost of calloc(1024 * 1024 * 1024, 1) will be equal to the call to malloc for the same amount of memory, despite the fact that calloc also promises to fill the memory with zeros. Clever!

Back to our bug

So: if CFFI used calloc , then why was the memory reset?

For a start: calloc was not always used. But I suspected that in this case I could reproduce the deceleration directly using calloc , so I threw the program again:

 #include <stdlib.h> #define ALLOCATION_SIZE (100 * 1024 * 1024) int main (int argc, char *argv[]) { for (int i = 0; i < 10000; i++) { void *temp = calloc(ALLOCATION_SIZE, 1); free(temp); } return 0; }

A very simple C program that allocates and releases 100 MB by calling calloc ten thousand times. Then exits. Next - two options ⁵ :

calloc can use the above virtual memory trick. In this case, the program should work quickly: the allocated memory is not actually used, is not paginated, and the pages do not become “dirty” (dirty). The OS is lying to us about the selection, and we do not catch her hand, so everything works fine.
calloc can draw malloc and manually reset memory using memset . This should be done very, very slowly: in total, we need to reset a terabyte of memory (ten thousand cycles of 100 MB each), which is very difficult.

This greatly exceeds the standard OS threshold for using the first option, so one can expect exactly this behavior. Indeed, Linux does just that: if you compile the code with GCC and run it, it will execute extremely quickly, generate several page errors and almost do not lead to memory load. But if you run the same program on MacOS, it will run extremely long: it took me almost eight minutes .

Moreover, if you increase ALLOCATION_SIZE (for example, 1000 * 1024 * 1024 ), then on MacOS this program will work almost instantly! What the hell?

What is going on here?

In-depth analysis

On MacOS, there is a sample utility (see man 1 sample ), which can tell a lot about the process being executed, registering its state. For our code, sample gives the following:

 Sampling process 57844 for 10 seconds with 1 millisecond of run time between samples Sampling completed, processing symbols... Sample analysis of process 57844 written to file /tmp/a.out_2016-12-05_153352_8Lp9.sample.txt Analysis of sampling a.out (pid 57844) every 1 millisecond Process: a.out [57844] Path: /Users/cory/tmp/a.out Load Address: 0x10a279000 Identifier: a.out Version: 0 Code Type: X86-64 Parent Process: zsh [1021] Date/Time: 2016-12-05 15:33:52.123 +0000 Launch Time: 2016-12-05 15:33:42.352 +0000 OS Version: Mac OS X 10.12.2 (16C53a) Report Version: 7 Analysis Tool: /usr/bin/sample ---- Call graph: 3668 Thread_7796221 DispatchQueue_1: com.apple.main-thread (serial) 3668 start (in libdyld.dylib) + 1 [0x7fffca829255] 3444 main (in a.out) + 61 [0x10a279f5d] + 3444 calloc (in libsystem_malloc.dylib) + 30 [0x7fffca9addd7] + 3444 malloc_zone_calloc (in libsystem_malloc.dylib) + 87 [0x7fffca9ad496] + 3444 szone_malloc_should_clear (in libsystem_malloc.dylib) + 365 [0x7fffca9ab4a7] + 3227 large_malloc (in libsystem_malloc.dylib) + 989 [0x7fffca9afe47] + ! 3227 _platform_bzero$VARIANT$Haswel (in libsystem_platform.dylib) + 41 [0x7fffcaa3abc9] + 217 large_malloc (in libsystem_malloc.dylib) + 961 [0x7fffca9afe2b] + 217 madvise (in libsystem_kernel.dylib) + 10 [0x7fffca958f32] 221 main (in a.out) + 74 [0x10a279f6a] + 217 free_large (in libsystem_malloc.dylib) + 538 [0x7fffca9b0481] + ! 217 madvise (in libsystem_kernel.dylib) + 10 [0x7fffca958f32] + 4 free_large (in libsystem_malloc.dylib) + 119 [0x7fffca9b02de] + 4 madvise (in libsystem_kernel.dylib) + 10 [0x7fffca958f32] 3 main (in a.out) + 61 [0x10a279f5d] Total number in stack (recursive counted multiple, when >=5): Sort by top of stack, same collapsed (when >= 5): _platform_bzero$VARIANT$Haswell (in libsystem_platform.dylib) 3227 madvise (in libsystem_kernel.dylib) 438

Here we clearly see that a lot of time is spent on the _platform_bzero$VARIANT$Haswell method. It is used to clear buffers. That is, MacOS resets them. Why?

Some time after the release, Apple makes open most of the core code of its OS. And you can see that this program spends a lot of time in libsystem_malloc . I went to opensource.apple.com , downloaded the libmalloc-116 archive with the source code I needed and began to investigate.

Looks like all the magic happens in large_malloc . This branch is needed for memory allocation larger than 127 KB, it uses a virtual memory trick. So why does everything slowly work for us?

It seems that the fact is that Apple has become too smart. In large_malloc behind the constant #define a bunch of code is hidden, CONFIG_LARGE_CACHE . Basically, all this code comes down to the “free-list” pages of large amounts of memory allocated for the program. If MacOS allocates a contiguous buffer of 127 KB to LARGE_CACHE_SIZE_ENTRY_LIMIT (approximately 125 MB), then libsystem_malloc will try to libsystem_malloc these pages if they can be used by another memory allocation process. Because of this, he does not have to ask the Darwin kernel page, which saves context switching and a system call: in principle, non-trivial savings.

However, this is the case for calloc when you need to reset the bytes. And if MacOS finds a page that can be reused and that was called from calloc , then the memory will be reset . All. And so every time.

This has its own reason: the zeroed pages are a limited resource, especially in the conditions of modest iron (I look at the Apple Watch). So if the page can be reused, it can be a good savings.

However, the page cache completely deprives us of the advantages of using calloc to provide zero memory pages. It would not be so bad if it were done only for dirty pages. If the application writes to a nullable page, then it will probably not be reset. But MacOS does this unconditionally . This means that even if you call alloc , free and calloc without touching the memory at all, then the second call to calloc will take the pages allocated during the first call and never supported by physical memory. Therefore, the OS has to load (page-in) all this memory in order to reset it, although it has already been reset . This is what we want to avoid using a virtual memory-based distribution engine when it comes to allocating large amounts: never used memory becomes used by the “list of free” pages.

As a result, on MacOS, the cost of calloc linearly increases depending on the size of the allocated memory up to 125 MB, despite the fact that other operating systems demonstrate the behavior of O (1) starting from 127 KB. After 125 MB, MacOS stops caching pages, so the speed magically takes off.

I did not expect to find such a bug from a program in Python, and I had a number of questions. For example, how many processor cycles are lost to reset the memory, which is already reset? How many context switches does it take to force applications to load (page-in) memory that they did not use (and are not going to) so that the OS could senselessly reset it?

It seems to me that all this confirms the loyalty of the old saying: there are leaks in all abstractions ( all abstractions are leaky ). You can not forget about this just because you are programming in Python. Your program runs on a machine that uses memory and all sorts of tricks to manage it. , , . , .

Radar 29508271 . , .

findings

, ? , CFFI : , , , . char OpenSSL, , OpenSSL . , OpenSSL . , OpenSSL , . . ) , OpenSSL, , ) . ( ) : OpenSSL , , . .
«». C-, , : type *array = malloc(number_of_elements * size_of_element) . , : number_of_elements size_of_element , . calloc , . , .
, « » — «». runtime Go .
, , .
, : !

Source: https://habr.com/ru/post/317476/

All Articles