How I reduced the Docker image by 98.8% with fanotify

I offer the readers of Habrakhabr the translation of the publication “How I shrunk a Docker image by 98.8% - featuring fanotify” .

A few weeks ago, I did an internal Docker talk. During the presentation, one of the administrators asked a simple at first glance question: “Is there something like a“ weight loss program for Docker images ””?

To solve this problem, you can find some quite adequate approaches on the Internet, like deleting cache directories, temporary files, reducing various redundant packages, if not the whole image. But if you think about it, do we really need a fully working Linux system? What files do we really need in a single image? For Go binary, I found a radical and fairly effective approach. It was compiled statically, with almost no external dependencies. The final image is 6.12 MB.

Yeah! But is there a chance to do something similar with any other application?
')
It turns out that this approach may exist. The idea is simple: we can profile the image one way or another at runtime to determine which files have been accessed / opened / ... and delete all remaining files that have not been noticed. Um, it sounds promising, let's write PoC for this idea.

Initial data

Image: Ubuntu (~ 200MB)
Application to be started: / bin / ls
Objective: Create the image with the smallest possible size.

/ bin / ls is a good example: fairly simple to test ideas, without pitfalls, but still not trivial, because it uses dynamic linking.

Now that we have a goal, let's define the tool. The main idea is to monitor the file access event. Be it stat or open. There are a couple of good candidates for this. We could use inotify , but it needs to be configured and each watch should be assigned to a separate file, which will eventually lead to a whole bunch of these most watch. We could use LD_PRELOAD, but, firstly, the use of his joy does not give me personally, and secondly, he will not intercept system calls directly, well, thirdly, it will not work for statically assembled applications (who said golang'ov?). A solution that would work even for a statically assembled application would be to use ptrace to trace system calls in real time. Yes, he also has subtleties in customization, but still it would be a reliable and flexible solution. The less well-known system call is fanotify and, as it has already become clear from the title of the article, it will be used.

fanotify was originally created as a “decent” mechanism for anti-virus vendors to intercept file system events, potentially on the entire mount point at a time. Sounds familiar? While it can be used to deny access, or simply perform non-blocking monitoring of access to a file, potentially throwing events if the kernel queue is full. In the latter case, a special message will be generated to notify the user space of the listener about the loss of the message. This is exactly what we need. Unobtrusive, the entire mount point at a time is easy to set up (well, assuming you find the documentation of course ...). This may seem ridiculous, but it is really important, as I learned later.

It is very easy to use.

Initialize fanotify to FAN_CLASS_NOTIFICATION mode using the fanotify_init system call:

// Open ``fan`` fd for fanotify notifications. Messages will embed a // filedescriptor on accessed file. Expect it to be read-only fan = fanotify_init(FAN_CLASS_NOTIF, O_RDONLY);

We subscribe to the FAN_ACCESS and FAN_OPEN events in "/" FAN_MARK_MOUNTPOINT using the fanotify_mark system call:

 // Watch open/access events on root mountpoint fanotify_mark( fan, FAN_MARK_ADD | FAN_MARK_MOUNT, // Add mountpoint mark to fan FAN_ACCESS | FAN_OPEN, // Report open and access events, non blocking -1, "/" // Watch root mountpoint (-1 is ignored for FAN_MARK_MOUNT type calls) );

We read messages from the file descriptor that we received from fanotify_init and iterate through them using FAN_EVENT_NEXT:

 // Read pending events from ``fan`` into ``buf`` buflen = read(fan, buf, sizeof(buf)); // Position cursor on first message metadata = (struct fanotify_event_metadata*)&buf; // Loop until we reached the last event while(FAN_EVENT_OK(metadata, buflen)) { // Do something interesting with the notification // ``metadata->fd`` will contain a valid, RO fd to accessed file. // Close opened fd, otherwise we'll quickly exhaust the fd pool. close(metadata->fd); // Move to next event in buffer metadata = FAN_EVENT_NEXT(metadata, buflen); }

As a result, we print the full name of each file that was accessed, and add queue overflow detection. For our purposes this should be quite enough (comments and error checking are omitted for the sake of illustrating the solution).

 #include <fcntl.h> #include <limits.h> #include <stdio.h> #include <sys/fanotify.h> int main(int argc, char** argv) { int fan; char buf[4096]; char fdpath[32]; char path[PATH_MAX + 1]; ssize_t buflen, linklen; struct fanotify_event_metadata *metadata; // Init fanotify structure fan = fanotify_init(FAN_CLASS_NOTIF, O_RDONLY); // Watch open/access events on root mountpoint fanotify_mark( fan, FAN_MARK_ADD | FAN_MARK_MOUNT, FAN_ACCESS | FAN_OPEN, -1, "/" ); while(1) { buflen = read(fan, buf, sizeof(buf)); metadata = (struct fanotify_event_metadata*)&buf; while(FAN_EVENT_OK(metadata, buflen)) { if (metadata->mask & FAN_Q_OVERFLOW) { printf("Queue overflow!\n"); continue; } // Resolve path, using automatically opened fd sprintf(fdpath, "/proc/self/fd/%d", metadata->fd); linklen = readlink(fdpath, path, sizeof(path) - 1); path[linklen] = '\0'; printf("%s\n", path); close(metadata->fd); metadata = FAN_EVENT_NEXT(metadata, buflen); } } }

We collect:

 gcc main.c --static -o fanotify-profiler

Roughly speaking, now we have a tool to monitor any file access on the active “/” mount point in real time. Fine.

What's next? Let's create a Ubuntu container, start our monitoring, and run / bin / ls. Fanotify needs the CAP_SYS_ADMIN feature. This is basically a “catch-all” root feature. In any case, this is better than performing in the —privileged fashion.

 # Run image docker run --name profiler_ls \ --volume $PWD:/src \ --cap-add SYS_ADMIN \ -it ubuntu /src/fanotify-profiler # Run the command to profile, from another shell docker exec -it profiler_ls ls # Interrupt Running image using docker kill profiler_ls # You know, the "dynamite"

Result of performance:

 /etc/passwd /etc/group /etc/passwd /etc/group /bin/ls /bin/ls /bin/ls /lib/x86_64-linux-gnu/ld-2.19.so /lib/x86_64-linux-gnu/ld-2.19.so /etc/ld.so.cache /lib/x86_64-linux-gnu/libselinux.so.1 /lib/x86_64-linux-gnu/libacl.so.1.1.0 /lib/x86_64-linux-gnu/libc-2.19.so /lib/x86_64-linux-gnu/libc-2.19.so /lib/x86_64-linux-gnu/libpcre.so.3.13.1 /lib/x86_64-linux-gnu/libdl-2.19.so /lib/x86_64-linux-gnu/libdl-2.19.so /lib/x86_64-linux-gnu/libattr.so.1.1.0

Perfectly! It worked. Now we know for sure what is ultimately necessary for executing / bin / ls. So now we just copy all of this into the FROM scratch Docker image - and that's it.

But it was not there ... However, let's not get ahead of ourselves, everything is in order.

 # Export base docker image mkdir ubuntu_base docker export profiler_ls | sudo tar -x -C ubuntu_base # Create new image mkdir ubuntu_lean # Get the linker (trust me) sudo mkdir -p ubuntu_lean/lib64 sudo cp -a ubuntu_base/lib64/ld-linux-x86-64.so.2 ubuntu_lean/lib64/ # Copy the files sudo mkdir -p ubuntu_lean/etc sudo mkdir -p ubuntu_lean/bin sudo mkdir -p ubuntu_lean/lib/x86_64-linux-gnu/ sudo cp -a ubuntu_base/bin/ls ubuntu_lean/bin/ls sudo cp -a ubuntu_base/etc/group ubuntu_lean/etc/group sudo cp -a ubuntu_base/etc/passwd ubuntu_lean/etc/passwd sudo cp -a ubuntu_base/etc/ld.so.cache ubuntu_lean/etc/ld.so.cache sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/ld-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/ld-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/ld-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/ld-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libselinux.so.1 ubuntu_lean/lib/x86_64-linux-gnu/libselinux.so.1 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libacl.so.1.1.0 ubuntu_lean/lib/x86_64-linux-gnu/libacl.so.1.1.0 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libc-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/libc-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libpcre.so.3.13.1 ubuntu_lean/lib/x86_64-linux-gnu/libpcre.so.3.13.1 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libdl-2.19.so ubuntu_lean/lib/x86_64-linux-gnu/libdl-2.19.so sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libattr.so.1.1.0 ubuntu_lean/lib/x86_64-linux-gnu/libattr.so.1.1.0 # Import it back to Docker cd ubuntu_lean sudo tar -c . | docker import - ubuntu_lean

Run our image:

 docker run --rm -it ubuntu_lean /bin/ls

As a result, we obtain:

 # If you did not trust me with the linker (as it was already loaded when the profiler started, it does not show in the ouput) no such file or directoryFATA[0000] Error response from daemon: Cannot start container f318adb174a9e381500431370a245275196a2948828919205524edc107626d78: no such file or directory # Otherwise /bin/ls: error while loading shared libraries: libacl.so.1: cannot open

Yeah. But what went wrong? Remember, I mentioned that this system call was originally created to work with antivirus? Real-time anti-virus should detect access to the file, conduct checks and make decisions based on the result. What matters here is the contents of the file. In particular, race conditions in the file system should be handled with all their might. This is the reason why fanotify gives file descriptors instead of the paths that were accessed. The calculation of the physical file path is performed by probing / proc / self / fd / [fd]. In addition, he is not able to tell which symbolic link has been accessed, only the file to which it points.

In order to make it work, we need to find all the links to the files found by fanotify, and install them in the filtered image in the same way. The find command will help us with this.

 # Find all files refering to a given one find -L -samefile "./lib/x86_64-linux-gnu/libacl.so.1.1.0" 2>/dev/null # If you want to exclude the target itself from the results find -L -samefile "./lib/x86_64-linux-gnu/libacl.so.1.1.0" -a ! -path "./

This can be easily automated by looping:

 for f in $(cd ubuntu_lean; find) do ( cd ubuntu_base find -L -samefile "$f" -a ! -path "$f" ) 2>/dev/null done

What ultimately gives us a list of missing semantic references. These are all libraries:

 ./lib/x86_64-linux-gnu/libc.so.6 ./lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ./lib/x86_64-linux-gnu/libattr.so.1 ./lib/x86_64-linux-gnu/libdl.so.2 ./lib/x86_64-linux-gnu/libpcre.so.3 ./lib/x86_64-linux-gnu/libacl.so.1

Now let's copy them from the original image and re-create the resulting image.

 # Copy the links sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libc.so.6 ubuntu_lean/lib/x86_64-linux-gnu/libc.so.6 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 ubuntu_lean/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libdl.so.2 ubuntu_lean/lib/x86_64-linux-gnu/libdl.so.2 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libpcre.so.3 ubuntu_lean/lib/x86_64-linux-gnu/libpcre.so.3 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libacl.so.1 ubuntu_lean/lib/x86_64-linux-gnu/libacl.so.1 sudo cp -a ubuntu_base/lib/x86_64-linux-gnu/libattr.so.1 ubuntu_lean/lib/x86_64-linux-gnu/libattr.so.1 # Import it back to Docker cd ubuntu_lean docker rmi -f ubuntu_lean; sudo tar -c . | docker import - ubuntu_lean

Important note : this method is limited. For example, he will not return links to links, as well as absolute links. The latter requires at least chroot. Or it should be executed from the source image, provided that find or its alternative is present in it.

Run the resulting image:

 docker run --rm -it ubuntu_lean /bin/ls

Now everything works:

 bin dev etc lib lib64 proc sys

Total

ubuntu : 209MB
ubuntu_lean : 2.5MB

As a result, we got the image, 83.5 times smaller. This is 98.8% compression.

Afterword

Like all profiling-based methods, he is able to tell what is actually done / used in this scenario. For example, try running / bin / ls -l in the final image and see for yourself.

spoiler for lazy

It does not work. Well, that is, it works, but not as expected.

Profiling technique is not without flaw. It does not allow to understand exactly how the file was opened, just that file. This is a problem for symbolic links, in particular cross-filesytems (read cross-volumes). With the help of fanotify, we will lose the original symbolic link and break the application.

If I had to build such a "shrinker", ready for use in production, I would most likely use ptrace.

Notes

I admit, in fact, it was interesting for me to experiment with system calls. Docker images are rather a good excuse;
In fact, it would be quite possible to use FAN_UNLIMITED_QUEUE, calling fanotify_init to bypass this restriction, provided that the calling process is at least CAP_SYS_ADMIN;
It is also 2.4 times smaller than the 6.13MB image that I mentioned at the beginning of this article, but the comparison is not fair.

Source: https://habr.com/ru/post/259021/

All Articles