📜 ⬆️ ⬇️

Experiments with malloc

image

As is known, in modern x86 (_64) and ARM architectures, the virtual memory of the process is linear and continuous, for, fortunately, the times of char near* and int huge* have passed. Virtual memory is divided into pages, the typical size of which is 4 KiB, and by default they are not mapped to physical memory, so working with them will not work. To see the current mapped address intervals for a process, on Linux, look at / proc / <pid> / maps, on OS X, vmmap <pid>. Each address interval has three types of protection: from execution, from writing and from reading. As you can see, the very first interval, starting with the load address (corresponding to the .text segment of the ELF in Linux, __TEXT and Mach-O in OS X), is readable and executable - very logical. You can also see that the stack is essentially no different from other intervals, and you can quickly calculate its size by subtracting the initial address from the end address. Pages are displayed using mmap / munmap , and protection is changed using mprotect . There are also brk / sbrk , deprecated ancient remnants of the past, which change the size of a single interval of "data" and are emulated by mmap in modern systems.

All malloc POSIX implementations in one way or another run into the functions listed above. Compared with the naive selection and release of pages, rounding the required size in a big way, malloc has many advantages:
')

Enough theory! We will feel malloc in practice. We will conduct three experiments. Work will be possible on POSIX-compatible OSes, in particular, work on Linux and OS X was tested.

Null from malloc


Let's start with the banal. If you override the function from libc (as well as any other library) in your code, then the linker will not mind if libc is dynamically connected (and by default it is), and will not swear at the double definition. For example, such code:

 #include <stdio.h> #include <stdlib.h> void* malloc(size_t size) { puts("malloc"); return NULL; } int main() { return (int)malloc(100500); } 

It will print "malloc" and have a zero return code ( echo $? ). However, let's check what happens if we call a function in the depths of which malloc is called, for example asprintf.

 // program.c #include <stdio.h> #include <stddef.h> void* malloc(size_t size) { puts("malloc"); return NULL; } int main() { char *s = NULL; asprintf(&s, "%d", 0); printf("%p\n", s); return 0; } 

And here it will greatly depend on the linker. If it is ld / linux then it will print

 malloc (nil) 

For malloc will be called ours, redefined, and the glibc implementation of printf does not use malloc. The redefinition came about because malloc in glibc is declared weak (__attribute __ ((weak))), and our default definition is strong (ELF specific). But with dyld / OS X, the behavior is different:

 0x7fc1eb403230 

In fact, malloc on the poppy is not overridden! The point should be in multi-level dyld namespaces. Well, well ...

 DYLD_FORCE_FLAT_NAMESPACE=1 ./program Segmentation fault: 11 

Hryav Baba buddy! The matter, apparently, did not even reach int main (). What is the reason?

 lldb lldb ./program (lldb) target create "./program" Current executable set to './program' (x86_64). (lldb) env DYLD_FORCE_FLAT_NAMESPACE=1 (lldb) r Process 11956 launched: './program' (x86_64) Process 11956 stopped * thread #1: tid = 0x12e214, 0x00007fff9ebb9dcb libsystem_kernel.dylib`ioctl + 67, stop reason = EXC_BAD_ACCESS (code=2, address=0x7fff5f3ffff8) frame #0: 0x00007fff9ebb9dcb libsystem_kernel.dylib`ioctl + 67 libsystem_kernel.dylib`ioctl: -> 0x7fff9ebb9dcb <+67>: movq %rcx, -0xb8(%rbp) 0x7fff9ebb9dd2 <+74>: movq %rdx, -0xc0(%rbp) 0x7fff9ebb9dd9 <+81>: leaq -0xd0(%rbp), %rax 0x7fff9ebb9de0 <+88>: movq %rax, -0x10(%rbp) (lldb) bt * thread #1: tid = 0x12e214, 0x00007fff9ebb9dcb libsystem_kernel.dylib`ioctl + 67, stop reason = EXC_BAD_ACCESS (code=2, address=0x7fff5f3ffff8) * frame #0: 0x00007fff9ebb9dcb libsystem_kernel.dylib`ioctl + 67 frame #1: 0x00007fff9a20f2c8 libsystem_c.dylib`isatty + 43 frame #2: 0x00007fff9a222ac6 libsystem_c.dylib`__smakebuf + 60 frame #3: 0x00007fff9a237b4a libsystem_c.dylib`__swsetup + 155 frame #4: 0x00007fff9a221d52 libsystem_c.dylib`__sfvwrite + 73 frame #5: 0x00007fff9a2264c9 libsystem_c.dylib`puts + 144 frame #6: 0x0000000100000f0b program`malloc(size=4096) + 27 at program.c:6 frame #7: 0x00007fff9a222af6 libsystem_c.dylib`__smakebuf + 108 frame #8: 0x00007fff9a237b4a libsystem_c.dylib`__swsetup + 155 ... frame #130931: 0x0000000100000f0b program`malloc(size=4096) + 27 at program.c:6 frame #130932: 0x00007fff9a222af6 libsystem_c.dylib`__smakebuf + 108 frame #130933: 0x00007fff9a237b4a libsystem_c.dylib`__swsetup + 155 frame #130934: 0x00007fff9a221d52 libsystem_c.dylib`__sfvwrite + 73 frame #130935: 0x00007fff9a2264c9 libsystem_c.dylib`puts + 144 frame #130936: 0x0000000100000f0b program`malloc(size=8) + 27 at program.c:6 frame #130937: 0x00007fff5fc1d22e dyld`operator new(unsigned long) + 30 frame #130938: 0x00007fff5fc095a5 dyld`std::__1::vector >::insert(std::__1::__wrap_iter, char const* (* const&)(dyld_image_states, unsigned int, dyld_image_info const*)) + 343 frame #130939: 0x00007fff5fc04507 dyld`dyld::registerImageStateBatchChangeHandler(dyld_image_states, char const* (*)(dyld_image_states, unsigned int, dyld_image_info const*)) + 147 frame #130940: 0x00007fff8bb8089e libdyld.dylib`dyld_register_image_state_change_handler + 76 frame #130941: 0x00007fff8bb8065f libdyld.dylib`_dyld_initializer + 47 frame #130942: 0x00007fff982829fd libSystem.B.dylib`libSystem_initializer + 116 frame #130943: 0x00007fff5fc12feb dyld`ImageLoaderMachO::doModInitFunctions(ImageLoader::LinkContext const&) + 265 frame #130944: 0x00007fff5fc13164 dyld`ImageLoaderMachO::doInitialization(ImageLoader::LinkContext const&) + 40 frame #130945: 0x00007fff5fc0f79d dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 305 frame #130946: 0x00007fff5fc0f732 dyld`ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 198 frame #130947: 0x00007fff5fc0f623 dyld`ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) + 127 frame #130948: 0x00007fff5fc0f893 dyld`ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&) + 75 frame #130949: 0x00007fff5fc020f1 dyld`dyld::initializeMainExecutable() + 208 frame #130950: 0x00007fff5fc05e5d dyld`dyld::_main(macho_header const*, unsigned long, int, char const**, char const**, char const**, unsigned long*) + 3793 frame #130951: 0x00007fff5fc01276 dyld`dyldbootstrap::start(macho_header const*, int, char const**, long, macho_header const*, unsigned long*) + 512 frame #130952: 0x00007fff5fc01036 dyld`_dyld_start + 54 

What kind of frame? 130952 Yes, it turns out we have Stack Overflow! And we also learned a few curious things: dyld is written in C ++, and puts for some reason allocates memory thus malloc, creating recursion. I want to believe that he does this only once when initializing the stdout buffer. Well, we have to replace it:

 #include <stdio.h> #include <stddef.h> #include <unistd.h> void* malloc(size_t size) { write(STDOUT_FILENO, "malloc\n", 7); return NULL; } int main() { char *s = NULL; asprintf(&s, "%d", 0); printf("%p\n", s); return 0; } 

Run and see:

 malloc malloc malloc Segmentation fault: 11 

Stack fragment:

 * thread #1: tid = 0x1309af, 0x00007fff5fc249ce dyld`_platform_bzero + 94, stop reason = EXC_BAD_ACCESS (code=1, address=0x8) * frame #0: 0x00007fff5fc249ce dyld`_platform_bzero + 94 frame #1: 0x00007fff5fc14045 dyld`calloc + 52 frame #2: 0x00007fff5fc0ce14 dyld`__cxa_get_globals + 100 frame #3: 0x00007fff5fc1ce7f dyld`__cxa_throw + 25 frame #4: 0x00007fff5fc1d267 dyld`operator new(unsigned long) + 87 

So, it is clear that, unlike GNU / Linux, whose ld is made on static allocation, a lot is heavily used when running an application in OS X. You can also see that throwing an exception about a failed allocation of memory by the operator new calls calloc, which, as we remember, is the combination malloc + zero-fill (bzero). The calloc implementation got a bold one and did not check the null pointer. With this knowledge, now I wondered what would happen if OS X truly runs out of memory, “to the last byte”. Obviously, the correct and logical solution would be to preallocate memory for std :: bad_alloc.

Ok, Google, how do we still redefine malloc under OS X so that nothing falls? We'll have to plunge into the implementation details. Malloc on Macs allocates memory in zones. Initially, there is only one zone, by default, and that is what vmmap will show at the end of the output. Each zone stores pointers to malloc, free and realloc, which allows you to flexibly customize memory management. You can take the default zone and replace the pointer with malloc in it:

 #include <stdio.h> #include <stddef.h> #include <unistd.h> #include <malloc/malloc.h> #include <sys/mman.h> void* zone_malloc(struct _malloc_zone_t *zone, size_t size) { write(STDOUT_FILENO, "malloc\n", 7); return NULL; } int main() { malloc_zone_t* zone = malloc_default_zone(); mprotect(zone, sizeof(*zone), PROT_READ | PROT_WRITE); zone->malloc = zone_malloc; mprotect(zone, sizeof(*zone), PROT_READ); char *s = NULL; asprintf(&s, "%d", 0); printf("%p\n", s); return 0; } 

Pay attention to mprotect. Initially, malloc_default_zone returns a pointer to a memory area that is write-protected. You can easily verify this by running the program without mprotect and examining the crash in the debugger and vmmap. Such protection is obtained from playful hands ... Back to PROT_READ, strictly speaking, protection could not be changed, added for the sake of order. What will be printed:

 malloc malloc 0x0 

We see that printf used malloc, but then found the strength to do without dynamic memory and still printed out a null pointer.

By the way, about the zones. Malloc in glibc uses a similar approach, called obstacks. On the one hand, there are many functions for working with them, on the other hand, there is no possibility to use different memory allocation algorithms in different obstacles.

Output: dyld, the OS X loader, is written in C ++ and working with a bunch in programs on this system begins long before int main (). C ld on Linux this does not happen and there are no calls to the heap.

Ineffective malloc


Now let's set a new goal for ourselves: create a dynamic library in which we implement our version of malloc.

 // hack_malloc.c #define _GNU_SOURCE #include <stdio.h> #include <stddef.h> #include <unistd.h> #include <sys/mman.h> void* malloc(size_t size) { write(STDOUT_FILENO, "malloc... ", 10); size += sizeof(size_t); int page_size = getpagesize(); int rem = size % page_size; if (rem > 0) { size += page_size - rem; } void* addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { write(STDOUT_FILENO, "fail\n", 5); return NULL; } write(STDOUT_FILENO, "ok\n", 3); *(size_t*)addr = size; return (size_t*)addr + 1; } void free (void *ptr) { write(STDOUT_FILENO, "free... ", 8); size_t* real_ptr = (size_t*)ptr - 1; if (!munmap(real_ptr, *real_ptr)) { write(STDOUT_FILENO, "ok\n", 3); } else { write(STDOUT_FILENO, "fail\n", 5); } } 

Here the simplest approach is implemented when we allocate memory by pages. It is necessary to store the size at the beginning of the page in order to have something to transfer to unmap. MAP_ANONYMOUS in mmap flags means that we are not mapping the actual file into memory, but physical memory (usually, mmap maps the files to memory, this gives acceleration in some operations). In the case of files, MAP_PRIVATE would create an individual copy at write (copy-on-write), but for us, in essence, does nothing, just the documentation requires the presence of either MAP_PRIVATE or MAP_SHARED. By the way, this code also works great with MAP_SHARED.

We will check by example:

 // test.c #include <stdio.h> #include <stdlib.h> int main() { printf("start\n"); void* mem = malloc(100); printf("malloc() -> %p\n", mem); *(int*)mem = 0; free(mem); printf("end\n"); return 0; } 

We will collect so:

 #Linux gcc -shared -o libhackmalloc.so -fPIC -std=c99 -O2 hack_malloc.c gcc test.c -std=c99 -L. -Wl,-rpath,. -lhackmalloc -O2 -o test # Mac OS X clang -dynamiclib -undefined suppress -flat_namespace -std=c99 -fPIC -O2 hack_malloc.c -o libhackmalloc.dylib clang test.c -std=c99 -L. -lhackmalloc -O2 -o test 

At startup, we will see:

 ./test start malloc... ok malloc() -> 0x10935b008 free... ok end 

The output for OS X and Linux is identical. In the case of OS X, we recall the dyld namespaces and make them flat, like the Windows 10 interface.

 DYLD_FORCE_FLAT_NAMESPACE=1 ./test malloc... ok malloc... ok free... ok malloc... ok malloc... ok free... fail free... fail free... fail malloc... ok malloc... ok free... fail malloc... ok 70  free... fail 17  malloc... ok free... ok malloc... ok malloc... ok malloc... ok free... fail malloc... ok start malloc... ok malloc() -> 0x1035d9008 free... ok end 

The program has worked, and already well. What is surprising is the apparent discrepancy between the number of calls to malloc and free before int main (). Also free failed many times. Those who are interested can run the test in the debugger, set the breakpoint on free and learn about the dark life of dyld a lot of new things, and we will move on.

Conclusion: it is quite possible to write the implementation of malloc "in the forehead" in 30 lines.

Spying on malloc


Let's try to use DLL injection technique to inject our malloc into other people's programs. I do not want to write my own effective implementation of the heap, although there are many interesting algorithms, for example Buddy . It would be possible to take any of the ready-made implementations , but we will apply the trick with RTLD_NEXT and refer to the system malloc. Consider this code:

 // trace_malloc.c #define _GNU_SOURCE #include <dlfcn.h> #include <fcntl.h> #include <stdio.h> #include <unistd.h> int fd = 0; void* (*__malloc)(size_t) = NULL; void* malloc(size_t size) { if (!__malloc) { __malloc = (void*(*)(size_t)) dlsym(RTLD_NEXT, "malloc"); } if (!fd) { fd = open("malloc.log", O_WRONLY | O_CREAT | O_TRUNC, 0666); } /* ... */ write(fd, record, sprintf(record, "%ld.%06ld\t%zu\n", sec, mcsec, size)); return __malloc(size); } 

A link to the full version of the code will be given at the end of the article. A half of it was eaten by the cross-platform implementation of clock_gettime, and the second by reference to clock_gettime, so I was forced to shorten it a bit. All the beauty in one single line:

 __malloc = (void*(*)(size_t)) dlsym(RTLD_NEXT, "malloc"); 

In it, we load the “previous” malloc. Usually dlsym is used to pull functions from loaded dynamic libraries, but we use magic RTLD_NEXT as a library descriptor. Strictly speaking, it is not in POSIX, but in fact it is supported by many linkers. So we get a pointer to true malloc, save it, and call it later, returning its result. Along the way, log all calls.

We build the same way as hack_malloc.c, use on Linux like this:

 LD_PRELOAD=/path/to/libtracemalloc.so program 

The path must be absolute, otherwise the magic will not happen. LD_PRELOAD is a special environment variable that forcibly loads these libraries before the main ones with which the program is compiled. Thus, you can override arbitrary functions or solve temporary problems with running incorrectly linked programs (that is the lib * .so: not found message).

Ls, for example, creates about 2 KB of log. And whoami will drop with the message undefined symbol: dlsym, because dlsym is defined in libdl.so, which some load and some do not. And it makes no sense to build libtracemalloc with -ldl, since LD_PRELOAD will not load the dependencies of the injected libraries. We'll have to do something like this:

 LD_PRELOAD=/usr/lib/.../libdl.so:/path/to/libtracemalloc.so whoami 

And we will see a kilobyte log of memory allocation even in the case of such an elementary utility.

Ok, and what about OS X? Dyld supports the DYLD_INSERT_LIBRARIES environment variable, which does the same thing. We try:

 DYLD_INSERT_LIBRARIES=/path/to/libtracemalloc.dylib ls 

... does not work, remembering the namespace:

 DYLD_INSERT_LIBRARIES=/path/to/libtracemalloc.dylib DYLD_FORCE_FLAT_NAMESPACE=1 ls 

... and again, bummer. Already interesting! It turns out that the case in the protection of system programs System Integrity Protection . This mechanism does not allow changing files, injecting code, debugging paths like / System, / usr, etc. using extended file attributes. Fortunately, / usr / local has been pardoned.

 lldb /bin/ls (lldb) target create "/bin/ls" Current executable set to '/bin/ls' (x86_64). (lldb) r error: process exited with status -1 (cannot attach to process due to System Integrity Protection) 

SIP can be disabled, but we will do it easier - we will copy the programs we are interested in to our directory:

 cp $(which ls) . DYLD_INSERT_LIBRARIES=/path/to/libtracemalloc.dylib DYLD_FORCE_FLAT_NAMESPACE=1 ./ls 

This is already working. In conclusion, we prove the well-known thesis on two types of memory allocation. Create a malloc log and build a histogram of size distribution using an elementary IPython code:

 %pylab import pandas, seaborn log = pandas.DataFrame.from_csv("malloc.log", sep='\t') log[log < 100].hist() log[log < 100].count() / len(log) 

A typical picture will be visible on the size histogram (Y is the number of calls, X is the size, the program is clang):



I deliberately cut the tail on 100 bytes, since larger allocations are so rare that they are not visible on the histogram. So, 98% of all memory allocations on the heap are less than 100 bytes, which means that a good malloc should serve at least two separate domains: for large objects and for all others.

Note that in order to analyze your program, you can not mess around with the self-assembling library like the one described above, but take the ready one. For example, tcmalloc allows you to profile a bunch and do many other useful things.

The code from the article is available on github . Next time we will take a real, large program, collect a log of memory allocations during its work and try to make predictions based on the LSTM model of the recursive neural network.

Source: https://habr.com/ru/post/281497/


All Articles