OutOfLine - memory placement pattern for high performance C ++ applications

While working at Headlands Technologies, I was lucky to write several utilities to simplify the creation of high-performance C ++ code. This article offers a general overview of one of these utilities, OutOfLine .

Let's start with an illustrative example. Suppose you have a system that deals with a large number of file system objects. This may be ordinary files, named UNIX sockets or pipes. For some reason, you open a lot of file descriptors at startup, then work intensively with them, and at the end close the descriptors and delete the links to the files (note lane means unlink function).

The initial (simplified) version may look like this:

 class UnlinkingFD { std::string path; public: int fd; UnlinkingFD(const std::string& p) : path(p) { fd = open(p.c_str(), O_RDWR, 0); } ~UnlinkingFD() { close(fd); unlink(path.c_str()); } UnlinkingFD(const UnlinkingFD&) = delete; };

And this is a good, logical design. It relies on RAII to automatically free the handle and delete the link. You can create a large array of such objects, work with them, and when the array ceases to exist, the objects themselves will clear everything that was needed in the process.

But what about performance? Suppose fd used very often, and path only when deleting an object. Now the array consists of objects of 40 bytes in size, but often only 4 bytes are used. This means there will be more misses in the cache, since you need to “skip” 90% of the data.

One of the frequent solutions to this problem is the transition from an array of structures to an array structure. This will provide the desired performance, but at the cost of eliminating RAII. Is there an option combining the advantages of both approaches?

A simple compromise is replacing std::string , 32 bytes in size with std::unique_ptr<std::string> , which is only 8 bytes in size. This will reduce the size of our object from 40 bytes to 16 bytes, which is a great achievement. But this solution still loses the use of multiple arrays.

OutOfLine is a tool that allows you to completely move rarely used (cold) fields outside an object without abandoning RAII. OutOfLine is used as the base class CRTP , therefore the first argument of the template must be a child class. The second argument is the type of rarely used (cold) data that is associated with the frequently used (main) object.

 struct UnlinkingFD : private OutOfLine<UnlinkingFD, std::string> { int fd; UnlinkingFD(const std::string& p) : OutOfLine<UnlinkingFD, std::string>(p) { fd = open(p.c_str(), O_RDWR, 0); } ~UnlinkingFD(); UnlinkingFD(const UnlinkingFD&) = delete; };

So what is this class like?

 template <class FastData, class ColdData> class OutOfLine {

The basic idea of implementation is to use a global associative container that maps pointers to main objects and pointers to objects containing cold data.

  inline static std::map<OutOfLine const*, std::unique_ptr<ColdData>> global_map_;

OutOfLine can be used with any type of cold data, an instance of which is created and associated with the main object automatically.

  template <class... TArgs> explicit OutOfLine(TArgs&&... args) { global_map_[this] = std::make_unique<ColdData>(std::forward<TArgs>(args)...); }

Deleting the main object automatically deletes the associated cold object:

  ~OutOfLine() { global_map_.erase(this); }

When moving (move constructor / move assignment operator) of the main object, the corresponding cold object will be automatically associated with the new main successor object. As a result, do not refer to the cold data of the moved-from object.

  explicit OutOfLine(OutOfLine&& other) { *this = other; } OutOfLine& operator=(OutOfLine&& other) { global_map_[this] = std::move(global_map_[&other]); return *this; }

In the example implementation, OutOfLine is made uncopyable for simplicity. If necessary, copying operations are easy to add, they just need to create and link a copy of a cold object.

 OutOfLine(OutOfLine const&) = delete; OutOfLine& operator=(OutOfLine const&) = delete;

Now, for this to be really useful, it’s good to have access to cold data. When inheriting from OutOfLine class gets the constant and non-constant cold() methods:

  ColdData& cold() noexcept { return *global_map_[this]; } ColdData const& cold() const noexcept { return *global_map_[this]; }

They return the appropriate type of cold data reference.

That's almost all. This version of UnlinkingFD will be 4 bytes in size, will provide cache-friendly access to the fd field and retain the advantages of RAII. All work related to the life cycle of an object is fully automated. When the primary frequently used object moves, rarely used cold data moves with it. When the main object is deleted, the corresponding cold object is also deleted.

Sometimes, however, your data conspires to complicate your life - and you are faced with a situation in which the basic data must be created first. For example, they are needed to construct cold data. There is a need to create objects in the reverse order of what OutOfLine offers. For such cases, we need a “spare run” to control the order of initialization and deinitialization.

  struct TwoPhaseInit {}; OutOfLine(TwoPhaseInit){} template <class... TArgs> void init_cold_data(TArgs&&... args) { global_map_.find(this)->second = std::make_unique<ColdData>(std::forward<TArgs>(args)...); } void release_cold_data() { global_map_[this].reset(); }

This is another OutOfLine constructor that can be used in child classes, it accepts a tag of type TwoPhaseInit . If you create an OutOfLine in this way, the cold data will not be initialized, and the object will remain half constructed. To complete a two-phase construction, call the init_cold_data method (passing in the arguments necessary to create an object of type ColdData ). Remember that you cannot call .cold() on an object whose cold data has not yet been initialized. By analogy, cold data can be deleted ahead of schedule, before the ~OutOfLine destructor ~OutOfLine , by calling release_cold_data .

 }; // end of class OutOfLine

Now that's it. So what do these 29 lines of code give us? They represent another possible tradeoff between performance and ease of use. In cases when you have an object, some of whose members are used much more often than others, OutOfLine can be an easy-to-use way to optimize the cache, at the cost of significantly slowing down access to rarely used data.

We were able to apply this technique in several places - quite often there is a need to supplement the heavily used working data with additional metadata that is needed when completing the work in rare or unexpected situations. Whether it is information about the users who established the connection, the trading terminal from which the order came, or the descriptor of the hardware accelerator engaged in processing the exchange data - OutOfLine keeps the cache clean when you are in the critical path.

I prepared a test so that you can see and appreciate the difference.

Scenario	Time (ns)
Cold data in the main object (initial version)	34684547
Cold data completely deleted (best scenario)	2938327
Using OutOfLine	2947645

I got about 10 times faster when using OutOfLine . Obviously, this test is designed to demonstrate the potential of OutOfLine , but it also shows how much cache optimization can have a significant impact on performance, as well as the fact that OutOfLine allows OutOfLine to get this optimization. Keeping the cache free from sparsely used data can provide a difficult-to-measure comprehensive improvement in the rest of the code. As always when optimizing, trust the measurements more than you expect, nonetheless I hope that OutOfLine will be a useful tool in your piggy bank of utilities.

Note from the translator

The code given in the article serves to demonstrate the idea and is not representative of the production code.

Source: https://habr.com/ru/post/421475/

All Articles

OutOfLine - memory placement pattern for high performance C ++ applications

Note from the translator

More articles: