How we got rid of GC pauses using our own java off-heap storage solution

Hi, Habr!

Some systems simply cannot provide adequate response without data caching. And sooner or later they may stumble upon the problem that the data that I would like to cache will become more and more. If your system is written in java, then this leads to the inevitable pauses of the GC. Once Odnoklassniki also encountered this problem. We did not want to limit ourselves to the size of the data being cached, but at the same time we understood that the GC simply would not allow us to have the Heap we needed. On the other hand, we wanted to continue writing in java. In this topic, we describe how we solved this problem for ourselves with all the advantages and disadvantages of our approach, as well as experience of using. We hope that our approach will interest those who have to deal with the pauses of GC.

Justification of the chosen solution

In the review article about the architecture of the project, we already mentioned that we use our own development framework for caching, as well as storing some data. It is written entirely in Java, but the data is stored off-heap . You probably have a question: why did we need such a non-standard approach with off-heap , and why we did not use a ready-made solution?
')
Obviously, the most efficient storage (in particular, cache) is that which is always in RAM and is accessible to the application process without any network interaction. For applications written in java, the storage implementation is also reasonable in the java programming language, since the integration problem disappears. In addition, any programmer working on the system can easily understand the intricacies of his work by looking at the source code. But since with a large heap, almost any java program begins to suffer from GC pauses, with a total data size of several gigabytes and higher, the system becomes incapable (unless the entire Heap is not just one big array of bytes). Configuring GC is not a trivial task. If your heap already contains dozens of gigabytes, then it’s almost impossible to configure the GC so that the pauses are adequate for the application with which the user interacts. And as a large RAM becomes more and more accessible, the approach with the java off-heap cache becomes more and more popular.

Once upon a time this approach could really be considered non-standard. But, look, now more and more products appear and develop in this direction. About one and a half years ago, the first solution of this kind appeared on the market: BigMemory from Terracota. Half a year ago, Hazelcast announced that they had put a product called Elastic Memory on alpha testing . But, unfortunately, it is still in the beta stage, and besides, like BigMemory, it will be paid. About half a year ago, Cassandra began to use off-heap row cache. But this is a cache for internal purposes, which is not so easy to pick out to use at home. There is no strong open-source product in this area. However, it is worth noting that in the fall the DirectMemory project got into the apache incubator. However, when to wait for the finished product and when it will recommend itself in large projects, the question is open. We began to use our solution more than four years ago, nothing ready, as you see, at that time was simply not there.

Implementation

Communicating java applications with off-heap memory can be implemented in several ways . Classmates at one time chose an approach based on the use of the sun.misc.Unsafe class, which is included in the private HotSpot package, and now OpenJDK. This class allows you to work directly with memory, without using JNI explicitly, which leaves our solution cross-platform, and we can run the same binaries as on the main system, as well as on the developer’s computer, and it doesn’t matter if the developer is running Windows, Ubuntu or Mac. In order not to write a complicated memory management algorithm, such as, for example, we decided to do it in DirectMemory, we allocate or free memory for each object that we want to put in storage or remove from it. Let the operating system deal with memory management, it is very good at it. Because of the fact that we often need to allocate and free memory, we do not use the standard java.nio.ByteBuffer class, which is part of the JDK, has an open API and allows you to work with memory off-heap. The problem with ByteBuffer in our case is that it creates additional garbage and does not allow you to directly release the memory, but does it on the basis of phantom links. The latter leads to the fact that, allocating a lot of memory outside the heap, it cannot be freed until the GC works, even if the ByteBuffer objects that allocated this memory are no longer in use. Although there is the -XX: MaxDirectMemorySize = flag to solve this problem, which initiates garbage collection when the allocated memory reaches the specified value heap, still, a large number of phantom links can negatively affect the GC. This behavior of ByteBuffer is most likely due to the fact that it was designed to allocate large amounts of memory and reuse them. We use the opposite pattern of work. Of course, the rejection of ByteBuffer made us do some things on our own, such as, for example, taking care of a byte order .

An inquisitive reader may ask: are we not ashamed to use the HotSpot (sun.misc.Unsafe) private mechanisms for our functionality, because they can change or disappear at one point? No, not at all. After all, this very HotSpot of the new version will not appear in our working versions unexpectedly, i.e. we will be ready for this. Everything we use from Unsafe uses the already mentioned ByteBuffer. Those. if something changes in the next version, it will be enough for us to look at the source code and make the same changes in our framework, or rather, in the same class responsible for allocating / freeing memory and saving data.

Opportunities of our framework

There is nothing difficult to serialize an object into an array of bytes and place it outside the heap. What else does the framework help us in Odnoklassniki?

Granularity updates. In order to put an object in our storage, it is necessary to describe its structure, specifying how to save each field. Our framework allows you to save the structure of an object consisting of any primitives, arrays, collections and composite objects. Each field of the object must be matched with our library algorithm for the description of saving and updating. It is possible to create a new or expand an existing one if such a need suddenly appears. This declarative approach allows you to do point updates of the fields, and not to read the entire object, deserialize it, do an update, serialize and save it back. You can also make different views of objects in the repository. So, if some service does not need all the data of the stored object, it can query and read only a part of this object. This also helps a lot with filtering. For example, when you want to pull objects according to a list of identifiers, but only those that meet some criterion.

Read-through . Some of our storages help to remove the load from the base, i.e. used as caches. For such a case, we have a read-through option. If there is no data in the cache that the client requests, then we go to the database and read them. Here we still have some tricks, for example, if two identical requests come at the same time, then we will go to the database only once, and the second request will simply wait for the result of the first. This is very helpful in the case of a cold start. Or if the client constantly requests data that is not in the cache, then after a certain number of attempts, we stop accessing the database for this data.

Synchronization with the database. In the case when the storage is running on top of the database, we run a background process that will load all changes to the database that got there to bypass the cache for some reason.

Snepshoting. Our framework also allows you to reset all storage to disk. Thus, when restarting the server, all stored data is restored. In the case when the storage is used as a cache to relieve the load from the database, then snapshoting allows you to avoid a cold start when the cache is empty and all requests go directly to the database. And the synchronization mechanism described above allows us to roll onto the last snapshot data that came to the database during the restart of the repository.

Client library. If the storage becomes very large and does not fit completely on our commodity iron, then we distribute it across several machines and use sharding on the key. And we do this not because of the increasing pauses of the GC, but precisely because of the limitation of iron. Of course, all the charm of the in-process cache is lost here and network interaction appears, but there's nothing you can do. But our old familiar code, written entirely in our beloved Java, remains. In addition, some logic can also be easily transferred to these separate boxes, since the CPU still stands idle there. Our client library also allows you to update or receive only the required fields of objects from the storage, which significantly reduces traffic. For fault tolerance and scalability, we use duplication in such a way that we have at least two hosts responsible for each shard. Requests for reading go on a round-robin algorithm, updates go to all nodes.

Disadvantages and forced compromises

Of course, there are drawbacks in our framework. For example, due to the fact that the keys are still stored in the heap, at some point the GC may begin to affect the performance of the application. We try not to use keys more than Long, so even on our data volumes we are quite satisfied with the current characteristics.

As already mentioned, unlike some existing solutions, we allocate memory for each object separately. Although it can not affect the defragmentation of memory, but we did without a complicated memory management algorithm.

To effectively be able to make granular updates, we have to allocate memory separately for each variable-length field (such as a string or collection). Thus, to create a single object, sometimes we make several low-level calls.

Use inside classmates

Using the above framework, we store user profiles, group profiles, meta information on photos, information on classes (like), statistics of user actions in relation to their friends and groups, pieces of tape and something else. For example, a distributed statistics repository, based on which users' weights are counted for getting into the tape, contains about 300GB of data. The maximum size of the data in the storage memory in one box reaches 90 GB. The average load on a 2 processor piece of hardware with 4 cores on each peak is about 20K requests per second.

It is worth noting that our storage does not in any way serve as a substitute for RDBMS or something like NoSQL solution. More often we use it simply as a cache for the database. Perhaps in some places of use we could switch to NoSQL instead of the existing cache bundle plus RDBMS, but this requires a strong revision of the business logic , which may be based on the transactionality provided by the RDBMS.

Thus, the described framework allows us to write all the code in java, store in a heap such amount of data that a task requires, not have a headache with GC settings, quickly implement storage or caching of new data.

It will also be very interesting for us to learn about your experience in solving problems of GC pauses on large volumes of heap.

Source: https://habr.com/ru/post/139185/

All Articles