⬆️ ⬇️

Restriction of memory available to the program

I decided to do something about sorting a million integers with 1 MB of available memory. But before that, I had to think about how you can limit the amount of available memory for the program. And that's what I came up with.



Process virtual memory



Before plunging into different methods of memory limitation, you need to know how the virtual memory of the process is organized. The best article on this topic is “Anatomy of a program in memory” .



After reading the article, I can offer two possibilities for limiting memory: to reduce the virtual address space or the volume of the heap.

')

First: a decrease in the amount of address space. It's pretty simple, but not entirely correct. We can not reduce the entire space to 1 MB - not enough space for the core and libraries.



Second: heap reduction. It is not so easy to do, and usually nobody does it, because it is only available through fussing with the linker. But for our task it would be a more correct option.



I will also consider other methods, such as tracking memory usage through intercepting library and system calls, and changing the program environment through emulation and sandboxing.



For testing, we will use a small program called big_alloc, which places and then releases 100 MiB.



#include <stdio.h> #include <stdlib.h> #include <string.h> #include <stdbool.h> // 1000   100 KiB = 100 000 KiB = 100 MiB #define NALLOCS 1000 #define ALLOC_SIZE 1024*100 // 100 KiB int main(int argc, const char *argv[]) { int i = 0; int **pp; bool failed = false; pp = malloc(NALLOCS * sizeof(int *)); for(i = 0; i < NALLOCS; i++) { pp[i] = malloc(ALLOC_SIZE); if (!pp[i]) { perror("malloc"); printf("  %d \n", i); failed = true; break; } //     ,   copy-on-write. memset(pp[i], 0xA, 100); printf("pp[%d] = %p\n", i, pp[i]); } if (!failed) printf("  %d \n", NALLOCS * ALLOC_SIZE); for(i = 0; i < NALLOCS; i++) { if (pp[i]) free(pp[i]); } free(pp); return 0; } 




All source code is on github .



ulimit



What the old unix hacker immediately recalls when he needs to limit memory. This is a bash utility that allows you to limit program resources. In fact, this is an interface to setrlimit.



We can set a limit on the amount of memory for the program.



 $ ulimit -m 1024 




Checking:



 $ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 7802 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) 1024 open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 




We set the limit of 1024 kb - 1 MiB. But if we try to run the program, it will work without errors. Despite the limit of 1024 kb, the top shows that the program takes as much as 4872 kb.



The reason is that Linux does not set hard limits, and the man says:



 ulimit [-HSTabcdefilmnpqrstuvx [limit]] ... -m The maximum resident set size (many systems do not honor this limit) 


There is also a ulimit -d option that should work , but still doesn't work because of mmap (see the linker section).



QEMU



QEMU is great for manipulating the software environment. It has the –R option to limit the virtual address space. But it cannot be limited to too small values ​​- libc and kernel won't fit.



Look:



 $ qemu-i386 -R 1048576 ./big_alloc big_alloc: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory 


Here -R 1048576 leaves 1 MiB to the virtual address space.



For this you need to take something about 20 MB. Here is:



 $ qemu-i386 -R 20M ./big_alloc malloc: Cannot allocate memory Failed after 100 allocations 


Stops after 100 iterations (10 MB).



In general, QEMU is still the leader among the methods for limiting it; you just need to play around with the –R value.



Container



Another option is to run the program in a container and limit resources. To do this, you can:





But resources will be limited by the Linux subsystem called cgroups. You can play with them directly, but I recommend through lxc. I would like to use docker, but it only works on 64-bit machines.



LXC is the LinuX Containers. This is a set of tools and libraries from userspace for managing kernel functions and creating containers — isolated secure environments for applications, or for the entire system.



The kernel functions are as follows:





Documentation can be found on the site or in the author's blog .



To run an application in a container, you must provide lxc-execute config, where you specify all the settings of the container. You can start with examples in / usr / share / doc / lxc / examples. Man recommends starting with lxc-macvlan.conf. Let's start:



 # cp /usr/share/doc/lxc/examples/lxc-macvlan.conf lxc-my.conf # lxc-execute -n foo -f ./lxc-my.conf ./big_alloc Successfully allocated 102400000 bytes 




Works!



Now let's limit the memory with cgroup. LXC allows you to configure a memory subsystem for a cgroup container, setting memory limits. Parameters can be found in the RedHat documentation . I found 2:





What I added to lxc-my.conf:



 lxc.cgroup.memory.limit_in_bytes = 2M lxc.cgroup.memory.memsw.limit_in_bytes = 2M 


Run:



 # lxc-execute -n foo -f ./lxc-my.conf ./big_alloc # 


Silence - apparently, the memory is too little. Let's try to run from the shell



 # lxc-execute -n foo -f ./lxc-my.conf /bin/bash # 


bash did not start. Let's try / bin / sh:



 # lxc-execute -n foo -f ./lxc-my.conf -l DEBUG -o log /bin/sh sh-4.2# ./dev/big_alloc/big_alloc Killed 


And in the dmesg you can track the glorious death process:



 [15447.035569] big_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... [15447.035779] Task in /lxc/foo [15447.035785] killed as a result of limit of [15447.035789] /lxc/foo [15447.035795] memory: usage 3072kB, limit 3072kB, failcnt 127 [15447.035800] memory+swap: usage 3072kB, limit 3072kB, failcnt 0 [15447.035805] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0 [15447.035808] Memory cgroup stats for /lxc/foo: cache:32KB rss:3040KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:1588KB active_anon:1448KB inactive_file:16KB active_file:16KB unevictable:0KB [15447.035836] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [15447.035963] [ 9225] 0 9225 942 308 10 0 0 init.lxc [15447.035971] [ 9228] 0 9228 833 698 6 0 0 sh [15447.035978] [ 9252] 0 9252 16106 843 36 0 0 big_alloc [15447.035983] Memory cgroup out of memory: Kill process 9252 (big_alloc) score 1110 or sacrifice child [15447.035990] Killed process 9252 (big_alloc) total-vm:64424kB, anon-rss:2396kB, file-rss:976kB 


Although we did not receive an error message from big_alloc about malloc failure and the amount of available memory, it seems to me that we successfully limited the memory with the help of containers. For now let's stop on this.



Linker



Let's try to change the binary image, limiting the available heap space. Layout - the last step in building the program. To do this, use the linker and its script. Script - the description of sections of the program in memory, along with all sorts of attributes and other things.



Layout script example:



 ENTRY(main) SECTIONS { . = 0x10000; .text : { *(.text) } . = 0x8000000; .data : { *(.data) } .bss : { *(.bss) } } 


Point means current position. For example, the .text section starts at 0 Ă— 10,000, and then, starting at 0 Ă— 8000000, we have the following two sections: .data and .bss. The entry point is main.



Everything is cool, but it won't work in real programs. The main function, which you write in C programs, is not really the first to be called. First a lot of initializations and erasures are done. This code is contained in the C runtime (crt) library and distributed among the crt # .o libraries in / usr / lib.



Details can be seen by running gcc –v. First, it calls ccl, creates an assembler code, translates it into an object file via as, and finally collects everything together with ELF using collect2. collect2 - ld wrapper. It accepts an object file and 5 additional libraries to create the final binary image:



  /usr/lib/gcc/i686-redhat-linux/4.8.3/./././crt1.o /usr/lib/gcc/i686-redhat-linux/4.8.3/./././crti.o /usr/lib/gcc/i686-redhat-linux/4.8.3/crtbegin.o /tmp/ccEZwSgF.o <-     /usr/lib/gcc/i686-redhat-linux/4.8.3/crtend.o /usr/lib/gcc/i686-redhat-linux/4.8.3/./././crtn.o 


All this is very difficult, so instead of writing my own script, I will edit the linker script by default. Get it by passing -Wl, -verbose to gcc:



 gcc big_alloc.c -o big_alloc -Wl,-verbose 


Now we will think how to change it. Let's see how the binary is built by default. Let's compile and look for the address of the .data section. Here is the output of objdump -h big_alloc



 Sections: Idx Name Size VMA LMA File off Algn ... 12 .text 000002e4 080483e0 080483e0 000003e0 2**4 CONTENTS, ALLOC, LOAD, READONLY, CODE ... 23 .data 00000004 0804a028 0804a028 00001028 2**2 CONTENTS, ALLOC, LOAD, DATA 24 .bss 00000004 0804a02c 0804a02c 0000102c 2**2 ALLOC 


The .text, .data and .bss sections are located around 128 MiB.



Let's see where the stack is with gdb:



 [restrict-memory]$ gdb big_alloc ... Reading symbols from big_alloc...done. (gdb) break main Breakpoint 1 at 0x80484fa: file big_alloc.c, line 12. (gdb) r Starting program: /home/avd/dev/restrict-memory/big_alloc Breakpoint 1, main (argc=1, argv=0xbffff164) at big_alloc.c:12 12 int i = 0; Missing separate debuginfos, use: debuginfo-install glibc-2.18-16.fc20.i686 (gdb) info registers eax 0x1 1 ecx 0x9a8fc98f -1701852785 edx 0xbffff0f4 -1073745676 ebx 0x42427000 1111650304 esp 0xbffff0a0 0xbffff0a0 ebp 0xbffff0c8 0xbffff0c8 esi 0x0 0 edi 0x0 0 eip 0x80484fa 0x80484fa <main+10> eflags 0x286 [ PF SF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 


esp indicates 0xbffff0a0, which is about 3 GiB. So we have a bunch of ~ 2.9 GiB.



In the real world, the top address of the stack is random, it can be seen, for example, in the output:



 # cat /proc/self/maps 


As we know, the heap grows from the end of .data towards the stack. What if we move the .data section as high as possible?



Let's place a data segment in 2 MiB in front of the stack. We take the top of the stack, subtract 2 MiB:



0xbffff0a0 - 0x200000 = 0xbfdff0a0



Shift all sections starting with .data to this address:



 . = 0xbfdff0a0 .data : { *(.data .data.* .gnu.linkonce.d.*) SORT(CONSTRUCTORS) } 


Compile:



 $ gcc big_alloc.c -o big_alloc -Wl,-T hack.lst 


The -Wl and -T hack.lst options tell the linker to use hack.lst as a work script.



Look at the title:



 : Idx Name Size VMA LMA File off Algn ... 23 .data 00000004 bfdff0a0 bfdff0a0 000010a0 2**2 CONTENTS, ALLOC, LOAD, DATA 24 .bss 00000004 bfdff0a4 bfdff0a4 000010a4 2**2 ALLOC 


Still, the data is stored in memory. How? When I tried to look at the values ​​of the pointers returned by malloc, I saw that the placement starts somewhere after the end of the section. Data at addresses like 0xbf8b7000, gradually continues with increasing pointers, and then returns to lower addresses like 0xb5e76000. It looks like the pile is growing down.



If you think, nothing strange about it. I checked the glibc source and found out that when brk fails, mmap is used. This means that glibc asks the kernel to place pages, the kernel sees that the process has a lot of holes in virtual memory, and places a page in one of the empty spaces, after which glibc returns a pointer from it.



Running big_alloc under strace confirmed the theory. Look at the normal binary:



 brk(0) = 0x8135000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77df000 mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77c7000 mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000 mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000 mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77c6000 mprotect(0x42425000, 8192, PROT_READ) = 0 mprotect(0x8049000, 4096, PROT_READ) = 0 mprotect(0x42269000, 4096, PROT_READ) = 0 munmap(0xb77c7000, 95800) = 0 brk(0) = 0x8135000 brk(0x8156000) = 0x8156000 brk(0) = 0x8156000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77de000 brk(0) = 0x8156000 brk(0x8188000) = 0x8188000 brk(0) = 0x8188000 brk(0x81ba000) = 0x81ba000 brk(0) = 0x81ba000 brk(0x81ec000) = 0x81ec000 ... brk(0) = 0x9c19000 brk(0x9c4b000) = 0x9c4b000 brk(0) = 0x9c4b000 brk(0x9c7d000) = 0x9c7d000 brk(0) = 0x9c7d000 brk(0x9caf000) = 0x9caf000 ... brk(0) = 0xe29c000 brk(0xe2ce000) = 0xe2ce000 brk(0) = 0xe2ce000 brk(0xe300000) = 0xe300000 brk(0) = 0xe300000 brk(0) = 0xe300000 brk(0x8156000) = 0x8156000 brk(0) = 0x8156000 +++ exited with 0 +++ 


And now on the modified:



 brk(0) = 0xbf896000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778f000 mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7777000 mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000 mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000 mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7776000 mprotect(0x42425000, 8192, PROT_READ) = 0 mprotect(0x8049000, 4096, PROT_READ) = 0 mprotect(0x42269000, 4096, PROT_READ) = 0 munmap(0xb7777000, 95800) = 0 brk(0) = 0xbf896000 brk(0xbf8b7000) = 0xbf8b7000 brk(0) = 0xbf8b7000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778e000 brk(0) = 0xbf8b7000 brk(0xbf8e9000) = 0xbf8e9000 brk(0) = 0xbf8e9000 brk(0xbf91b000) = 0xbf91b000 brk(0) = 0xbf91b000 brk(0xbf94d000) = 0xbf94d000 brk(0) = 0xbf94d000 brk(0xbf97f000) = 0xbf97f000 ... brk(0) = 0xbff8e000 brk(0xbffc0000) = 0xbffc0000 brk(0) = 0xbffc0000 brk(0xbfff2000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7676000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7576000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7476000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7376000 ... brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1c76000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1b76000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1a76000 brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 ... brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 +++ exited with 0 +++ 


Shifting the .data section to the stack in order to reduce heap space does not make sense, since the kernel will place the page in empty space.



Sandbox



Another way to limit the program's memory is sandboxing. The difference from emulation is that we do not emulate anything, but simply track and control some things in the behavior of the program. Usually used in security research, when you isolate a malware and analyze it so that it does not harm your system.



Trick with LD_PRELOAD


LD_PRELOAD is a special environment variable forcing the dynamic linker to use preloaded libraries in priority, incl. libc. This trick, by the way, is also used by some malware .



I wrote a simple sandbox that intercepts malloc / free calls, works with memory, and returns ENOMEM when the limit is reached.



To do this, I made a shared library with my implementations around malloc / free, increasing the counter by the amount of malloc, and decreasing when free is called. It is preloaded via LD_PRELOAD.



My implementation of malloc is:



 void *malloc(size_t size) { void *p = NULL; if (libc_malloc == NULL) save_libc_malloc(); if (mem_allocated <= MEM_THRESHOLD) { p = libc_malloc(size); } else { errno = ENOMEM; return NULL; } if (!no_hook) { no_hook = 1; account(p, size); no_hook = 0; } return p; } 


libc_malloc is a pointer to the original malc from libc. no_hook local flag in the stream. Used to enable malloc in hooks and to avoid recursive calls.



malloc is used implicitly in the account function by the uthash library. Why use a hash table? Because when you call free, you only pass a pointer to it, and inside free it is not known how much memory was allocated. Therefore, you have a table with pointers-keys and the amount of memory allocated as values. This is what I do in malloc:



 struct malloc_item *item, *out; item = malloc(sizeof(*item)); item->p = ptr; item->size = size; HASH_ADD_PTR(HT, p, item); mem_allocated += size; fprintf(stderr, "Alloc: %p -> %zu\n", ptr, size); 


mem_allocated is a static variable that is compared to a constraint in malloc.



Now when you call free, the following happens:



 struct malloc_item *found; HASH_FIND_PTR(HT, &ptr, found); if (found) { mem_allocated -= found->size; fprintf(stderr, "Free: %p -> %zu\n", found->p, found->size); HASH_DEL(HT, found); free(found); } else { fprintf(stderr, "Freeing unaccounted allocation %p\n", ptr); } 


Yes, just reduce mem_allocated.



And the coolest thing is that it works.



 [restrict-memory]$ LD_PRELOAD=./libmemrestrict.so ./big_alloc pp[0] = 0x25ac210 pp[1] = 0x25c5270 pp[2] = 0x25de2d0 pp[3] = 0x25f7330 pp[4] = 0x2610390 pp[5] = 0x26293f0 pp[6] = 0x2642450 pp[7] = 0x265b4b0 pp[8] = 0x2674510 pp[9] = 0x268d570 pp[10] = 0x26a65d0 pp[11] = 0x26bf630 pp[12] = 0x26d8690 pp[13] = 0x26f16f0 pp[14] = 0x270a750 pp[15] = 0x27237b0 pp[16] = 0x273c810 pp[17] = 0x2755870 pp[18] = 0x276e8d0 pp[19] = 0x2787930 pp[20] = 0x27a0990 malloc: Cannot allocate memory Failed after 21 allocations 


Full library code on github



It turns out that LD_PRELOAD is a great way to limit memory.



ptrace


ptrace is another opportunity to build a sandbox. This is a system call allowing you to control the execution of another process. Built into various POSIX OS.



This is the basis of such tracers as strace, ltrace, and almost all sandboxing programs - systrace, sydbox, mbox and debuggers, including gdb.



I made my tool with ptrace. It tracks the calls to brk and measures the distance between the initial break value and the new one, which is set by the next call to brk.



The program forks and runs 2 processes. The parent is the tracer, and the child is the tracer. In the child process, I call ptrace (PTRACE_TRACEME) and then execv. In the parent, I use ptrace (PTRACE_SYSCALL) to stop at syscall and filter out calls to brk from the child, and then another ptrace (PTRACE_SYSCALL) to get the value returned by brk.



When brk is outside the specified value, I set -ENOMEM as the return value for brk. This is set in the eax register, so I just overwrite it with ptrace (PTRACE_SETREGS). Here is the tastiest part:



 //    if (!syscall_trace(pid, &state)) { dbg("brk return: 0x%08X, brk_start 0x%08X\n", state.eax, brk_start); if (brk_start) // We have start of brk { diff = state.eax - brk_start; //      //    brk  -ENOMEM if (diff > THRESHOLD || threshold) { dbg("THRESHOLD!\n"); threshold = true; state.eax = -ENOMEM; ptrace(PTRACE_SETREGS, pid, 0, &state); } else { dbg("diff 0x%08X\n", diff); } } else { dbg("Assigning 0x%08X to brk_start\n", state.eax); brk_start = state.eax; } } 


I also intercept calls to mmap / mmap2, since libc has enough brains to call them when there are problems with brk. So when the specified value is exceeded and I see a call to mmap, I break it off with ENOMEM.



Works!



 [restrict-memory]$ ./ptrace-restrict ./big_alloc pp[0] = 0x8958fb0 pp[1] = 0x8971fb8 pp[2] = 0x898afc0 pp[3] = 0x89a3fc8 pp[4] = 0x89bcfd0 pp[5] = 0x89d5fd8 pp[6] = 0x89eefe0 pp[7] = 0x8a07fe8 pp[8] = 0x8a20ff0 pp[9] = 0x8a39ff8 pp[10] = 0x8a53000 pp[11] = 0x8a6c008 pp[12] = 0x8a85010 pp[13] = 0x8a9e018 pp[14] = 0x8ab7020 pp[15] = 0x8ad0028 pp[16] = 0x8ae9030 pp[17] = 0x8b02038 pp[18] = 0x8b1b040 pp[19] = 0x8b34048 pp[20] = 0x8b4d050 malloc: Cannot allocate memory Failed after 21 allocations 


But I do not like it. This is tied to the ABI, i.e. here you have to use rax instead of eax on a 64-bit machine, so you should either make a separate version, or use #ifdef, or use the -m32 option option. And most likely it will not work on other POSIX-like systems that may have a different ABI.



Other ways



What else can you try (these options were rejected for various reasons):





Links



Source: https://habr.com/ru/post/266083/



All Articles