In-memory-data-grid. Scalable Data Warehouse

Recently, interest in cloud architectures is growing every day, as this is one of the most effective ways to scale an application without much effort, and the bottleneck of any high-loaded project is a data warehouse, in particular, a relational database. To combat the shortcomings of traditional databases, 2 approaches are mainly used:

1) Caching query results

advantages: high speed data access
cons: requires a compromise between the relevance of data and access speed, because the data in the cache may become outdated, and deleting old data from the cache and then caching new ones are additional delays and system load

2) NoSQL solutions

Pros: good horizontal scalability, domain data model coincides with the data storage model
Minuses: low speed of obtaining results in the case of using a disk, it is almost impossible to ensure the operation of internal corporate software, which is focused on working with a specific relational database.

Today I want to introduce you to this type of data storage, which combines the advantages of both approaches and at the same time has several advantages over the above-mentioned solutions: In-memory-data-grid (IMDG) .

This approach very quickly gained wide acceptance among experts in the field of designing cloud platforms, as well as any systems that have the need for practical unlimited scaling of data storage systems. Many well-known companies have launched systems of this type on the market:

Oracle Coherence - Java / C / .NET
VMWare Gemfire - Java
GigaSpaces - Java / C / .NET
JBoss (RedHat) Infinispan - Java
Terracota - Java

Since I'm going to talk about solutions for Java, the IMDG cluster nodes will be JVMs, but this article will be interesting to those who are not related to Java, because firstly, some of the popular solutions have support for several languages, and second, even IMDG in Java can be used for quick access to data through the REST API.

So what is an in-memory-data-grid?

This is a cluster key-value storage that is designed for high-load projects with large amounts of data and increased requirements for scalability, speed and reliability.
The main parts of IMDG are caches (in GemFire this is called region).
The cache in IMDG is a distributed associative array (that is, the cache implements the Java Map interface), which provides fast concurrent access to data from any node of the cluster.

The cache also allows the processing of this data distributed, i.e. Any data can be modified from any node of the cluster, and it is not necessary to get the data from the cache, change it, and then put it back.
')
Almost all IMDG caches support transactions.
The data in the caches is stored in serialized form (that is, as an array of bytes).

1. Speed

All data is in the RAM of the cluster, due to which access time is significantly reduced.
Because all data is serialized, then the time to get any object from the cache = (time to move an object to a specific cluster node) + (time to deserialize).
In case the requested object is on the same node on which the request was made, then (time of receipt) = (time for deserialization). And here we see that access to data could be free at all if the object were not to be deserialized, for which the concept of near-cache was introduced into the concept of IMDG.

Near-cache is a local cache of objects for quick access, all objects in it are stored ready to use. If the near-cache for this cache is configured, then the objects will automatically get there when they first get a request for these objects.

Because near-cache may eventually grow to large sizes, as a result of which the memory may run out, the following possibilities are provided to limit the growth in the number of cached objects:

expiration - the lifetime of the object in the cache
eviction - deleting an object from the cache
limit on the number of stored objects

If desired (as well as insufficient memory), the data can be stored in a file or in a database.

2. Reliability

Data in the cluster is stored in partitions (parts), and these partitions are evenly distributed across the cluster, and each partition is replicated to a number of nodes (depending on the cluster configuration and the requirements for data storage reliability). The hit of an object in one or another partition is uniquely determined by some hash function.
Since when working under high load, the output of a separate cluster node (or several nodes at once, if they were virtual machines inside one iron server) is not improbable, to ensure data integrity, the cache configuration indicates the number of nodes that the cluster should lose survive without pain. This indicator determines the number of copies of each partition.

Those. if we indicate that the loss of 2 nodes should not lead to data loss, then each partition will be stored on 3 different nodes of the cluster, and if 2 nodes fall, the data will remain intact. If more than one node remains in the cluster, then 3 copies of all data will be created again, and the cluster will be ready for new troubles.

3. Scalability

The composition of the cluster (the number of nodes) can change without stopping the work of the entire cluster, and the grid itself works without monitoring the correct functioning of the cluster and the consistency and availability of data. Those. with increasing load or volume of data, you can simply pick up a few more configured nodes that automatically join the cluster, and the data inside the cluster is rebalanced to evenly distribute the data among the nodes, while the amount of data being transferred will be minimal so as not to create an extra load on the network .

4. Relevance of data

When using IMDG you always get the actual data, because when executing put, a notification is sent to all nodes of the cluster that objects with such keys received a new value. Each node updates its partitions containing these keys and deletes old values from its near-cache.

5. Reducing the load on the database

IMDG can be used not only as a standalone repository, but also as a system node that removes the load from a difficult-to-scale relational database.

For reading and writing from / to the database, for each cache in the configuration, a Loader is specified, which will be responsible for reading / writing objects in the database.
There are several options for organizing data access:

during the application launch, drain all necessary data from the database to the grid (the so-called preloading). The rise time of the application increases, memory consumption also increases, but the speed of work increases
while the application is running, pull up the necessary data on customer requests (read-through). It is executed automatically using the Loader object for the given cache. The recovery time of the application is small, the initial memory cost is also, but the additional time spent on processing requests that cause read-through

Variants of writing to the database when the corresponding data changes:

each put operation in the cache is automatically written to the database using the Loader (the so-called write-behind). Suitable only for systems whose main load is caused by reading.
data waiting to be written to the database is accumulated, and then one request is made to write to the database. A signal to the execution of such a request may be a certain amount of data waiting to be recorded, or a timeout. Suitable for write-intensive systems, but more difficult to implement

In the case of using IMDG as a node that assumes all the read / write / distributed processing load, we continue to have up-to-date data in the database, a low load on the database itself and, very importantly, corporate applications that use the database to collect statistics, reporting, etc. continue to work as before.

Conclusion

In-memory-data-grid is a relatively young, but well-proven technology, the development of which is carried out by many large vendors. It combines the advantages of NoSQL and caching systems, eliminates some of their significant shortcomings and allows you to raise system performance to a new level. If this article seemed interesting to you, then I will be glad to tell you next time about a specific solution from the IMDG family, as well as to address the issues of building and using indexes, serialization mechanisms and interaction with other platforms in these systems.

UPD: next article

Source: https://habr.com/ru/post/126580/

All Articles