Implementing a hot boot of C ++ code in Linux

* Link to the library at the end of the article. The article itself outlines the mechanisms implemented in the library, with medium detail. The implementation for macOS is not yet complete, but it differs little from the implementation for Linux. This is mainly an implementation for Linux.

Walking on a githaba one Saturday afternoon, I came across a library that implements updating c ++ code on the fly for windows. I myself got off windows a few years ago, not a bit sorry, and now all programming happens on either Linux (at home) or macOS (at work). A little googling, I found that the approach from the library above is quite popular, and msvc uses the same technique for the "Edit and continue" function in Visual Studio. The only problem is that I have not found a single implementation under non-windows (I looked bad?). Asked the author of the library above, whether he would make a port for other platforms, the answer was negative.

I will say right away that I was only interested in the option in which I would not have to change the existing project code (as, for example, in the case of RCCPP or cr , where all potentially reloadable code should be in a separate dynamically loaded library).

"How so?" - I thought, and began to smoke incense.

What for?

I mainly do game devs. Most of my work time I spend on writing game logic and layout of any visual. In addition, I use imgui for auxiliary utilities. My code cycle, as you probably guessed, is Write -> Compile -> Run -> Repeat. Everything happens pretty quickly (incremental build, all sorts of ccache, etc.). The problem here is that this cycle has to be repeated quite often. For example, I am writing a new game mechanics, let it be a "Jump", a suitable, controlled Jump:

1. Wrote a draft implementation based on an impulse, assembled, launched. I saw that I accidentally put an impulse every frame, not just once.

2. Fixed, collected, launched, now normal. But it would be necessary to take the absolute value of the impulse more.

3. Fixed, assembled, launched, running. But somehow felt wrong. We must try on the basis of strength to do.

4. Wrote a draft implementation based on strength, assembled, launched, running. It would be necessary only to change the instantaneous speed at the time of the jump.
...

10. Fixed, collected, launched, works. But still not that. Probably need to try the implementation based on the change gravityScale .
...

20. Great, looks great! Now we take out all the parameters in the editor for gamediz, test and fill.
...

30. Jump ready.

And at each iteration, you need to collect the code and in the running application to get to the place where I can jump. It usually takes at least 10 seconds. And if I can jump only in open areas, which still need to get to? And if I need to be able to jump on blocks of height N units? Here I already need to collect a test scene, which also needs to be debugged, and for which I also need to spend time. For such iterations, a hot reload of the code would be ideal. Of course, this is not a panacea, it is far from being suitable for everything, and even after a reboot, sometimes you need to re-create a part of the game world, and this should be taken into account. But in many things it can be useful and can save concentration and a lot of time.

Requirements and problem statement

When changing the code, the new version of all functions should replace the old versions of the same functions.
This should work on Linux and macOS.
This should not require changes to the existing application code.
Ideally, this should be a library, statically or dynamically linked to the application, without third-party utilities.
It is desirable that this library does not greatly affect the performance of the application.
It is enough if it works with cmake + make / ninja
It is enough if it works with debug builds (without optimizations, without cutting characters and other things)

This is the minimum set of requirements that an implementation must satisfy. Looking ahead, I will briefly describe what was implemented additionally:

Transferring the values of static variables to a new code (see section "Transferring Static Variables" to find out why this is important)
Reboot based on dependencies (changed the heading -> reassembled ~~half-project~~ all dependent files)
Reload code from dynamic libraries

Implementation

Up to this point, I was very far from the data domain, so I had to collect and assimilate information from scratch.

At a high level, the mechanism looks like this:

Monitor the file system for changes in the source
When the source changes, the library rebuilds it using the compile command that this file has already collected.
All collected object books are linked to a dynamically loaded library.
The library is loaded into the process address space.
All functions from the library replace the same functions in the application.
Static variable values are transferred from application to library.

Let's start with the most interesting thing - the mechanism for reloading functions.

Reloading functions

Here are 3 more or less popular ways of replacing functions in (or almost at) runtime:

The trick with LD_PRELOAD - allows you to build a dynamically loadable library with, for example, the strcpy function, and make it so that when you start the application, it takes my version of strcpy instead of the library
Modifying PLT and GOT tables - allows you to "overload" exported functions
Function hooking - allows you to redirect the flow of execution from one function to another

The first 2 options are obviously not suitable, since they work only with exported functions, and we do not want to mark all the functions of our application with any attributes. Therefore, Function hooking is our option!

In short, hooking works like this:

The address of the function is located.
The first few bytes of the function are overwritten by unconditional transfer to the body of another function.
...
Profit!
In msvc for this there are 2 flags - /hotpatch and /FUNCTIONPADMIN . The first one to the beginning of each function records 2 bytes, which do nothing, for their subsequent rewriting with a "short jump". The second allows you to leave an empty place in front of the body of each function in the form of nop instructions for the "long jump" to the desired place, so in 2 jumps you can switch from the old function to the new one. You can read more about how this is implemented in windows and msvc, for example, here .

Unfortunately, in clang and gcc there is nothing similar (at least under Linux and macOS). In fact, this is not such a big problem, we will write directly on top of the old function. In this case, we risk getting into trouble if our application is multi-threaded. If usually in a multi-threaded environment, we restrict access to data by one stream while another stream modifies them, then we need to limit the ability to execute code by one stream, while another stream modifies this code. I have not figured out how to do this, so the implementation will behave unpredictably in a multithreaded environment.

There is one subtle point. On a 32-bit system, 5 bytes is enough for us to "jump" to any place. On a 64-bit system, if we don’t want to spoil registers, we will need 14 bytes. The bottom line is that 14 bytes in machine code scale is quite a lot, and if there is any stub function with an empty body in the code, it is likely to be less than 14 bytes in length. I don’t know the whole truth, but I spent some time behind the disassembler while I thought, wrote and debugged the code, and I noticed that all functions are aligned on a 16-byte boundary (debug build without optimizations, not sure about optimized code). And this means that between the beginning of any two functions there will be at least 16 bytes, which is enough for us to “snag” them. Superficial googling led here , but I don’t know for sure, I was just lucky, or today all compilers are doing it. In any case, if in doubt, simply declare a couple of variables at the beginning of the stub function so that it becomes large enough.

So, we have the first bit - the mechanism for redirecting functions from the old version to the new one.

Search for functions in a shared program

Now we need to somehow get the addresses of all (not only exported) functions from our program or an arbitrary dynamic library. This can be done quite simply using system api, if characters are not cut out of your application. On Linux, this is api from elf.h and link.h , on macOS - loader.h and nlist.h .

Using dl_iterate_phdr we go through all the loaded libraries and, in fact, the program
Find the address where the library is loaded
From the .symtab section, .symtab retrieve all the information about the characters, namely the name, type, index of the section in which it lies, its size, and also we calculate its “real” address based on the virtual address and the library loading address

There is one subtlety. When loading the elf file, the system does not load the .symtab section (correct it if it is wrong), and the .dynsym section does not suit us, because we cannot get characters with visibility STV_INTERNAL and STV_HIDDEN . Simply put, we will not see such features:

 // some_file.cpp namespace { int someUsefulFunction(int value) // <----- { return value * 2; } }

and such variables:

 // some_file.cpp void someDefaultFunction() { static int someVariable = 0; // <----- ... }

Thus, in the 3rd paragraph we work not with the program that dl_iterate_phdr gave dl_iterate_phdr , but with the file that we downloaded from the disk and parsed with some elf parser (or on a bare api). So we won't miss anything. On macOS, the procedure is similar, only the function names from the system api are different.

After that we filter all characters and save only:

The functions that can be rebooted are STT_FUNC type STT_FUNC located in the .text section that have a non-zero size. Such a filter only passes functions whose code is actually contained in this program or library.
Static variables whose values need to be transferred are symbols of type STT_OBJECT , located in the .bss section

Broadcast units

To reload the code, we need to know where to get the source code files and how to compile them.

In the first implementation, I read this information from the .debug_info section, which contains debug information in the DWARF format. In order for each translation unit (ET) within the DWARF to contain the compilation line of this ET, it is necessary to pass fach -grecord-gcc-switches when compiling. I myself parsed DWARF library libdwarf , which comes with libelf . In addition to the compilation command from DWARF, you can also get information about the dependencies of our ETs on other files. But I refused this implementation for several reasons:

Libraries are quite weighty
Parsing DWARF applications assembled from ~ 500 ET, with dependency parsing, took a little more than 10 seconds

10 seconds at the start of the application - too much. After some deliberation, I rewrote the logic of parsing DWARF to parsing compile_commands.json . This file can be generated by simply adding set(CMAKE_EXPORT_COMPILE_COMMANDS ON) to your CMakeLists.txt. This way we get all the information we need.

Dependency handling

Since we have abandoned DWARF, we need to find another way to handle dependencies between files. Parse the files with your hands and look for them, you really don’t want to, and who knows more about dependencies than the compiler itself?

In clang and gcc there are a number of options that generate so-called depfiles almost for free. These files use the make and ninja build systems to resolve dependencies between files. Depfiles have a very simple format:

 CMakeFiles/lib_efsw.dir/libs/efsw/src/efsw/DirectorySnapshot.cpp.o: \ /home/ddovod/_private/_projects/jet/live/libs/efsw/src/efsw/base.hpp \ /home/ddovod/_private/_projects/jet/live/libs/efsw/src/efsw/sophist.h \ /home/ddovod/_private/_projects/jet/live/libs/efsw/include/efsw/efsw.hpp \ /usr/bin/../lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/c++/7.3.0/string \ /usr/bin/../lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/x86_64-linux-gnu/c++/7.3.0/bits/c++config.h \ /usr/bin/../lib/gcc/x86_64-linux-gnu/7.3.0/../../../../include/x86_64-linux-gnu/c++/7.3.0/bits/os_defines.h \ ...

The compiler puts these files next to the object files for each ET, we need to parse them and put them into the hashmap. Total parsing compile_commands.json + depfiles for the same 500 ET takes a little more than 1 second. In order for everything to work, we need to globally add the -MD flag for all project files in the compile option.

There is one subtlety associated with ninja. This build system generates depfiles regardless of the presence of the -MD flag for its needs. But after they are generated, it translates them into its binary format, and deletes the source files. Therefore, when running ninja, you must pass the -d keepdepfile flag. Also, for reasons unknown to me, in the case of make (with the -MD option), the file is named some_file.cpp.d , while with ninja it is called some_file.cpp.od . Therefore, you need to check the availability of both versions.

Static variable transfer

Suppose we have such a code (a very synthetic example):

 // Singleton.hpp class Singletor { public: static Singleton& instance(); }; int veryUsefulFunction(int value); // Singleton.cpp Singleton& Singletor::instance() { static Singleton ins; return ins; } int veryUsefulFunction(int value) { return value * 2; }

We want to change the veryUsefulFunction function to:

 int veryUsefulFunction(int value) { return value * 3; }

When restarting to the dynamic library with the new code, besides the veryUsefulFunction , the static variable static Singleton ins; will fall static Singleton ins; and method Singletor::instance . As a result, the program will start calling up new versions of both functions. But static ins in this library has not yet been initialized, and therefore the first access to it will call the constructor of the class Singleton . We certainly do not want this. Therefore, the implementation transfers the values of all such variables that it finds in the compiled dynamic library from the old code to this very dynamic library with the new code along with their guard variables .

There is one subtle and generally insoluble moment.
Suppose we have a class:

 class SomeClass { public: void calledEachUpdate() { m_someVar1++; } private: int m_someVar1 = 0; };

The method calledEachUpdate is called 60 times per second. We change it by adding a new field:

 class SomeClass { public: void calledEachUpdate() { m_someVar1++; m_someVar2++; } private: int m_someVar1 = 0; int m_someVar2 = 0; };

If an instance of this class is located in the dynamic memory or on the stack, after reloading the code, the application is likely to fall. The allocated instance contains only the m_someVar1 variable, but after a reboot, the method calledEachUpdate will try to change m_someVar2 , changing what actually does not belong to this instance, which leads to unpredictable consequences. In this case, the logic of transferring the state is transferred to the programmer, who must somehow save the state of the object and delete the object itself before reloading the code, and create a new object after the reboot. The library provides events in the form of the onCodePreLoad and onCodePostLoad delegate methods that the application can handle.

I do not know how (and whether it is possible) to resolve this situation in a general way, I will think. Now this case "more or less normally" will work only for static variables, the following logic is used there:

 void* oldVarPtr = ...; void* newVarPtr = ...; size_t oldVarSize = ...; size_t newVarSize = ...; memcpy(newVarPtr, oldVarPtr, std::min(oldVarSize, newVarSize));

This is not very correct, but it is the best that I came up with.

As a result, the code will behave unpredictably if the set and layout of fields in the data structures change in runtime. The same applies to polymorphic types.

Putting it all together

How it all works together.

The library is iterated by the headers of all the libraries dynamically loaded into the process and, in fact, by the program itself, it parses and filters the characters.
Further, the library tries to find the file compile_commands.json in the application directory and in the parent directories recursively, and gets out all the necessary information about the ET.
Knowing the path to the object files, the library loads and parses the depfiles.
After that, the most common directory for all files of the program source code is calculated, and monitoring of this directory starts recursively.
When a file changes, the library looks to see if it is in a hashmap of dependencies, and if there is, it starts in the background several compilation processes of the modified files and their dependencies using the compile commands from compile_commands.json .
When the program asks to reload the code (in my application, the combination Ctrl+r is assigned to it), the library waits for the completion of the compilation processes and links all the new object files to the dynamic library.
Then this library is loaded into the address space of the process by the dlopen function.
Information on symbols is loaded from this library, and the entire intersection of the set of symbols from this library and symbols already living in the process is either reloaded (if it is a function), or transferred (if it is a static variable).

It works very well, especially when you know what is under the hood and what to expect, at least at a high level.

Personally, I was very surprised by the lack of such a solution for Linux, is nobody really interested in this?

I will be glad to any criticism, thanks!

Reference to implementation

Source: https://habr.com/ru/post/435260/