Store 300 million objects in the CLR process

Stumbling Block - GC

All managed languages such as Java or C # have one major drawback - unconditional automatic memory management. It would seem that this is the advantage of managed languages. Remember how we floundered with dandling-pointers, not understanding where the precious 10KB per hour was flowing, forcing our favorite server to restart once a day? Of course, Java and C # (and others like them) at first glance solve the situation in 99% of cases.

That's the way it is, only here there is one problem: how to deal with a large number of objects, because in the same .Net there is no magic. The CLR should scan a huge set of objects and their reciprocal links. This problem is partially solved by introducing generations. Based on the fact that most objects do not live long, we release them faster and therefore do not have to go through all the objects of the hip each time.

But the problem is still there in cases where the objects must live long. For example, cache. It must contain millions of objects. Especially, given the increasing volumes of RAM on a typical modern server. It turns out that the cache can potentially store hundreds of millions of business objects (for example, Person with a dozen fields) on a machine with 64GB of memory.
')
However, in practice, this can not be done. As soon as we add the first 10 million objects and they “become obsolete” from the first generation to the second, the next full GC-scan “halts” the process for 8-12 seconds, and this pause is inevitable, i.e. we are already in background server GC mode and this is only the “stop-the-world” time. This leads to the fact that the server aplica simply “dies” for 10 seconds. Moreover, to predict the moment of "clinical death" is almost impossible.
What to do? Do not store a lot of objects for a long time?

What for

But I NEED to store a lot of objects for a long time in a specific task. For example, I keep a network of 200 million streets and their interconnections. After downloading from a flat file, my application should calculate the probability coefficients. It takes time. Therefore, I do this immediately as data is loaded from disk into memory. After that, I need to have an object graph, which is already precalculated and ready “for work and defense”. In short, I need to store about 48GB of data resident for several weeks while responding to hundreds of requests per second.

Here is another task. Caching social data, which accumulate hundreds of millions in 2-3 weeks, and you need to serve tens of thousands of read requests per second.

how

So we decided to make our memory manager and called it “Pile” (heap). For not to bypass the “crippling” managed memory model. Unmanaged memory does not save anything. Access to it takes time to check, which “kill” the speed and complicate the entire design. Neither .Net nor Java knows how to work in “normal” mode with chunks of memory that are not on the heap.

What have we done? Our memory manager is absolutely 100% managed code. We dynamically allocate byte arrays, which we call segments. Inside the segment we have a pointer — a regular int. And here we get PilePointer:

/// <summary> /// Represents a pointer to the pile object (object stored in a pile). /// The reference may be local or distributed in which case the NodeID is>=0. /// Distributed pointers are very useful for organizing piles of objects distributed among many servers, for example /// for "Big Memory" implementations or large neural networks where nodes may inter-connect between servers. /// The CLR reference to the IPile is not a part of this struct for performance and practicality reasons, as /// it is highly unlikely that there are going to be more than one instance of a pile in a process, however /// should more than 1 pile be allocated than this pointer would need to be wrapped in some other structure along /// with source IPile reference /// </summary> public struct PilePointer : IEquatable<PilePointer> { /// <summary> /// Distributed Node ID. The local pile sets this to -1 rendering this pointer as !DistributedValid /// </summary> public readonly int NodeID; /// <summary> /// Segment # within pile /// </summary> public readonly int Segment; /// <summary> /// Address within the segment /// </summary> public readonly int Address; ………………………………………………………………… }

Pay attention to NodeID, about it below. Get PilePointer as follows:

 var obj = new MyBusinessType(); var pilePointer = Pile.Put(obj); ………………………………………… // -   ,    var originalObj = Pile.Get(pilePointer);

We will get a copy of the original object, which we loaded into Pile using Put (), or PileAccessViolation, if the pointer is incorrect.

 Pile.Delete(pilePointer)

allows you to free up a piece of memory, respectively, an attempt to read this memory again will cause PileAccessViolation.

Question: how it is done and what we store in byte [], because we can not store CLR objects with real points, then they confuse the GC. We just need the opposite - to store something in our format, removing the managed references. Thus, we will be able to store data, and the GC will not know that these are objects, well, and will not visit them. This can be done through serialization. Of course, it is not the built-in .Net serializers that are meant (such as BinaryFormatter), but our relatives in the NFX.

PilePointer.NodeID allows you to “smear” data across distributed “heaps”, as it identifies a node in a distributed pile cohort.

And now the main question. Why is all this necessary if serialization is used “under the hood” and is it slow?

Speed

Actually it works like this: an object <300 bytes, immersed in a byte [] with the help of NFX Slim serialization, takes on average 10-25% less space than the native CLR object in memory. For large objects, this difference tends to zero. Why is this so? The fact is that NFX.Serialization.Slim.SlimSerializer uses UTF8 for strings and the variable length integer encoding + does not need 12+ bytes of the CLR header. As a result, the speed of the serializer becomes a stumbling block. SlimSerializer “holds” phenomenal speed. On a single Intel I7 Sandy Bridge core with a 3GHz frequency, we turn 440 thousand PilePointer'es into an object per second. Each object in this test has 20 fields filled and occupies 208 bytes of memory. Inserting objects into Pile with one core 405 thousand per second. Such speed is achieved due to the dynamic compilation of expression trees for each object being serialized into the pile segment. On average, the SlimSerializer works 5 times faster than the BinaryFormatter, although for many simple types this factor reaches 10. From the point of view of the space, the SlimSerializer packs data into 1/4 - 1/10 of what BinaryFormatter does. And most importantly. SlimSerializer DOES NOT DEMAND a special markup of fields in the objects with which we work. Those. you can store anything, except delegates.

The multithreaded data insertion test stably holds over 1 million transactions per second on CoreI7 3GHz.
Well, now the most important thing. Allocating 300.000.000 objects in our process a full GC takes less than 30 milliseconds

Results

NFX.ApplicationModel.Pile technology avoids unpredictable delays caused by the GC garbage collector, keeping hundreds of millions of objects resident in memory for a long time (week), providing access speeds faster than the “out-of-process” solution (such as MemCache, Redis et al.

Pile is based on a dedicated memory manager that allocates large byte [] and allocates memory for the application. The object immersed in Pile is identified by the structure PilePointer, ktr. occupies 12 bytes, which contributes to the creation of efficient object graphs, where objects are mutually referenced.

Get the code:

NFX GitHub

Source: https://habr.com/ru/post/257091/

All Articles

Store 300 million objects in the CLR process

Stumbling Block - GC

What for

how

Speed

Results

More articles: