The story about realloc (and laziness)

Simple macro

It all started with a simple macro: (approximate code)

#define ADD_BYTE(C) do { \ if (offset == capa) { \ if (capa < 16) { \ capa = 16; \ } else { \ capa <<= 1; \ } \ buffer = realloc(buffer, capa); \ assert(buffer != NULL); \ } \ buffer[offset++] = (C); \ } while(0)

For those who are not familiar with the C programming language, I will explain: this simple macro adds the “C” byte to the dynamically allocated buffer (buffer), the size of which (in bytes) is equal to capa. The next write position is determined by the offset parameter. Each time the buffer is filled, it doubles its size (starting with a minimum size of 16 bytes).

We add bytes to the dynamic buffer - this is one of the most common operations in almost any program (for working with strings, arrays, etc.).
')
But how to understand how effective the strategy of redistribution? If you look at this problem from the point of view of complexity , assuming the complexity of realloc () is O (N) , you will immediately realize that adding one byte has an average complexity of O (1), which is quite good.

But what is the difficulty in the worst case - if you need a redistribution of memory under the buffer?

Anonymus : Sure that this is a good strategy? You will run into serious performance problems if you increase the size of a large array, say one gigabyte. Just imagine the consequences, especially if the buffer needs to be moved out of the paging file.

Me : Hmm ... Honestly, I never thought about it ... But I am sure that everything is not so bad. The system must successfully cope with this task.

Anonimus : And it seems to me that the associated structure is a better option, even with the same exponential strategy.

Me : No, it's a waste of time.

Anonymus : Proof?

Oh, so it means?

Well, you have to justify their position. Like all programmers, I am very lazy. More precisely: " As befits a programmer , I am very lazy." Laziness is a great incentive to become smarter and more efficient. You are too lazy to perform repetitive or routine tasks; you need something more intelligent , something as fast as possible . Yes, that is what is sometimes called laziness . But, in my opinion, this is nothing but efficiency .

A linked list of buffer storage blocks is a somewhat cumbersome solution. I do not say that it is impossible . “ Impossible is not a French word, ” Napoleon used to say (though he ended badly). The solution is possible, but it is cumbersome. Copying a subarray or saving it to disk will require some effort. We probably would have to maintain the index of pointers to the beginning of each array, take the logarithm of base two to get the address of the beginning of the block and many more nudyatiny ... Wow, I'm already tired ...

Something tells me that my laziness will be most welcome here. After all, you should not worry about small blocks at all, but the virtual memory of the system will take care of large blocks.

Let's see.

By the way, what is realloc ()?

This is a common POSIX compliant function, which is implemented in the C library . In the case of Linux, this is the libc.so library, and there are related realloc functions malloc and free in it:

 nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep -E "T (malloc|realloc|free)$" 000000000007e450 T free 000000000007e030 T malloc 000000000007e4e0 T realloc

For those to whom it is really interesting: “T” means “text symbol” (text symbol), the capital letter is used to show that the symbol is public and visible; the program text segment is the code , the data segment is the initialized data (variables), and the bss segment is the uninitialized data (variables), or rather the data (variables) that received zero as the initial value).

A memory allocator is used to allocate, redistribute, and free memory (thanks, Captain Obvious!). There are many such distributors (the most famous one is the buddy allocator ).

We can implement our own simplest allocator using the great and terrible sbrk call, which simply adds empty space to the end of the data segment :

 #include <sys/types.h> #include <unistd.h> #include <string.h> void *malloc(size_t size) { return sbrk(size); } void free(void *ptr) { /*   ? */ } /* :    (,  ) */ void *realloc(void *ptr, size_t size) { void *nptr = malloc(size); if (nptr == NULL) { return NULL; } // «old_size»    :) // (,      ) memcpy(nptr, ptr, old_size); free(ptr); return nptr; }

[Note. As one of the Hacker News readers rightly noted, here you need the value of old_size, which, for example, could be written to the beginning of a block after the block was allocated. But no, we should not have released the source block in case of errors in realloc :)]

Of course, a real allocator will be a little more complicated, complex data structures will be required to limit the number of memcpy calls.

To understand what is happening with large blocks, we will have to get a closer look at Glibc on Linux.

Meet Glibc

Just download the latest version of glibc and explore the source tree.
There is one interesting directory malloc. Find the file malloc.c in it and open it in the editor.

 /*   : *        (>= 512 ),      FIFO (. .    ). *    (<= 64   )   ,      . *    ,        ,          ,   . *     (>= 128   )     ,   .  ,    ,    : http://gee.cs.oswego.edu/dl/html/malloc.html */

We are most interested in this part: “ For very large queries (> = 128 KB by default), system memory mapping tools are used if they are supported .”

The 128 KB threshold is configured by the mallopt () function:

 M_MMAP_THRESHOLD    ,   ,     ( ),   M_MMAP_THRESHOLD,     mmap(2),          sbrk(2).

So, as I said, the virtual memory service of the system will take care of large blocks .

In essence, this means that:

malloc will be implemented using mmap;
free will be implemented using munmap;
realloc will be implemented using mremap (What ?! There is no such thing in POSIX yet? Seriously?)

All right, let's slightly expand the knowledge about mremap.

So, what we know about man mremap:

 mremap()      Linux. mremap()       .  ,     realloc(3).

Aha, very effective realloc . How effective?

First, mmap, munmap and mremap are described in the glibc library. In fact, only entry points:

 nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep -E "(mmap|munmap|mremap)$" 00000000000e4350 W mmap 00000000000e9080 W mremap 00000000000e4380 W munmap

Please note that the default entry points are in this case “weak” characters . That is, they can be redefined by someone else during dynamic linking. For example:

 ldd /tmp/sample linux-vdso.so.1 (0x00007584a8aaa000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007584a86e6000) /lib64/ld-linux-x86-64.so.2 (0x00007584a8aac000

... characters can be redefined by the linux-vdso.so.1 library - this is a magic library that is displayed in all programs under Linux and allows you to speed up some calls, including system calls .
In any case, our characters in the glibc library are only channels to system core calls (syscall), be it glibc or vdso (see the default implementation: sysdeps / unix / sysv / linux / mmap64.c). For example:

 void * __mmap64 (void *addr, size_t len, int prot, int flags, int fd, off64_t offset) { //    void *result; result = (void *) INLINE_SYSCALL (mmap2, 6, addr, len, prot, flags, fd, (off_t) (offset >> page_shift)); return result; }

So, our initial question is already connected not with glibc, but with the Linux kernel.

Introducing the kernel

Download the latest version of the kernel, and let's briefly look at how mremap works.

Looking in the manual ( The Linux Kernel Howto ), you will find a very interesting directory:

The mm directory contains all the necessary code for working with memory. An architecture-specific memory management code is placed in arch / * / mm / type directories, for example, arch / i386 / mm / fault.c.

Fine. We need them!
Here is an interesting file: mm / mremap.c. In it you will find the entry point for the system call of the mremap function. Here:

 /* *  ( )  , ,    * (    MREMAP_MAYMOVE    VM) * *  MREMAP_FIXED  5  1999 .   (Benjamin LaHaise) *     MREMAP_MAYMOVE. */ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len, unsigned long, new_len, unsigned long, flags, unsigned long, new_addr) {

It is here that we enter the kernel when executing the corresponding system call in the user code (via glibc or the corresponding vdso channel).

After examining the code of this function, you will see various argument checks and handling of trivial cases (for example, reducing a memory block — you just need to free the pages at the end of the block).

Then the kernel will try to expand the mapped area by increasing it (the distributor in the C library would do the same with sbrk). In case of successful expansion, the function returns the result.

All these trivial cases are implemented with O (1) complexity (even if we take into account the cost of entering the kernel, they will be less, since interrupts are no longer used, but they will still be).

And what's the worst case ?

  /* *         *       . */ ret = -ENOMEM; if (flags & MREMAP_MAYMOVE) { unsigned long map_flags = 0; if (vma->vm_flags & VM_MAYSHARE) map_flags |= MAP_SHARED; new_addr = get_unmapped_area(vma->vm_file, 0, new_len, vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT), map_flags); if (new_addr & ~PAGE_MASK) { ret = new_addr; goto out; } map_flags = vma->vm_flags; ret = move_vma(vma, addr, old_len, new_len, new_addr); if (!(ret & ~PAGE_MASK)) { track_exec_limit(current->mm, addr, addr + old_len, 0UL); track_exec_limit(current->mm, new_addr, new_addr + new_len, map_flags); } }

At first glance, the core does the same thing that we do in our simple allocator:

highlights a new area;
copies the source area to a new one, and then releases it (i.e., a move operation is performed).

But let's take a closer look.

Here is the call to move_vma:

 static unsigned long move_vma(struct vm_area_struct *vma, unsigned long old_addr, unsigned long old_len, unsigned long new_len, unsigned long new_addr) { ... new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT); new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff, &need_rmap_locks); if (!new_vma) return -ENOMEM; moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, need_rmap_locks);

There are two interesting challenges:

copy_vma;
move_page_tables.

The copy_vma function is described in the mm / mmap.c file; it moves a structure of type vm_area_struct - this is the internal structure of the kernel, describing a block of memory.

The definition can be found here: include / linux / mm_types.h. This small structure contains all the information about the region: the starting and ending addresses, the file on the disk (if the memory area is used to display the file), etc.

So, this move operation is quite simple - O (1).

And what does the move_page_tables function do?

The most interesting, it seems, here in this cycle:

 ... for (; old_addr < old_end; old_addr += extent, new_addr += extent) { ... move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr, need_rmap_locks); ...

This code is somewhat complicated (besides, there are no comments), but basically these are basic arithmetic operations and calculations.

The move_ptes function contains the innermost loop of the lowest level:

 ... for (; old_addr < old_end; old_pte++, old_addr += PAGE_SIZE, new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) continue; pte = ptep_get_and_clear(mm, old_addr, old_pte); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); ... set_pte_at(mm, new_addr, new_pte, pte); } ...

So what happens in this cycle ?! We simply move the rows of the page table ( PTE , Page Table Entries ) corresponding to the memory area to another place. In essence, these are the integers assigned to each page.

So, what we have: in fact, the core did not touch the data from our associated block; to move the whole area, it was enough to swap a few bits.
Formally, the complexity is still O (N), but

instead of copying whole 4 KB pages (or 2 MB for very large pages), we swap whole numbers inside the core structure;
we work with the “hot” memory of the kernel, without touching the “cold” and especially the data pushed into the paging file.

Therefore, we use O (N), but with a huge difference.

Oh, by the way, do you know that O (N) is confusing in many cases?

In our case, the maximum value of N is 2 ⁴⁸ (the maximum size of the virtual space). Most computers work with only a few gigabytes of memory - no more than 64 GB for most architectures (that is, 2 ³⁶ bytes). The big page is 2 ²¹ bytes, the maximum number of operations is 2 ¹⁵ for move_ptes (yes, it is only 32 768 operations).

So, in the worst case, the costs are always extremely low, and for some reason I did not doubt it from the very beginning.

I also recommend reading: the book Understanding the Linux Virtual Memory Manager by Mel Gorman (Mel Gorman).

"Niasilil mnogabukaf"? Do not worry. Laziness and realloc are our everything.

Source: https://habr.com/ru/post/238739/

All Articles