Many, fast, distributed: how to choose an in-memory data grid solution

We needed a way to give the machine a memory so that it could, in Turing terminology, quickly bury the data and just as quickly dig it out.
Neil Stephenson, Kryptonomicon

The photo of the magnetic core memory module in the IBM 1401 mainframe, used as a background on this image, reminds us of the times when the computers were large and the memory expensive. Today, as we learn from the post below, everything has changed ...

IMDG, grids, In-Memory Data Grids - as soon as they do not call the system, which turned out to be the topic of the post. And although the name is absolutely true, and grids, as a tool, are becoming more and more popular, many still confuse them with systems of distributed caches, now with NoSQL-databases, or even completely believe that “if you place MySQL on RAM- drive, you’ll get almost IMDG. ”

Not so long ago, the decision to accumulate information, but after it was processed, seemed logical, and the emerging query languages to the information repositories looked like an excellent solution: each stage of the process of working with information was highlighted and well controlled enough. But times are changing, and today more and more often business declares its desire to process information not “yesterday’s”, but current, literally to have “processing online”, and in relation to information rather large volumes. And here, whether we like it or not, we are forced to look for new tools.
')

The growing number of users of online services, the massive proliferation of social networks, the emergence and development of the Internet of things, the development of services in the banking and telecom sectors - all this has affected the IT industry in a predictable way: the number of transactions has increased dramatically, although the requirements for the time of each of them they only toughened up, but no miraculous solutions working out of the box for a long time did not appear. In other words, the demand of time sounded like “do what you want, but do it faster, otherwise the business will not give back!”

At first, the cloud was successfully helping to grow the number of requests for web applications: the idea of scaling computing capacity in the cloud required, of course, support and accounting in the application code, but at least it became clear that the application would not rest on the processor and memory hardware. . But with the storage of the processed information, the case turned out to be more complicated, because the usual options for such storage (for example, a classical SQL DBMS), in turn, gradually turned out to be a bottleneck of the entire system. An application, all of which computational nodes (no matter how much autoscaling can raise such nodes) are waiting for a response from the SQL server, is, frankly, a heartbreaking sight!

Of course, a good architect will also try to build a working system in such a stalemate (from here we get caches on each data flow, binding for their invalidation, NoSQL server in addition or replacing their SQL predecessors, rewriting the application logic taking into account the above; how There is nothing to say, there is something to do for the whole team!), but, you see, I would like, of course, a solution “a bit simpler”.

This is where IMDG solutions came to the scene, combining caching, NoSQL approaches, and a distributed approach (often even distributed via WAN, and not only by LAN) to building a system. In essence, they are clustered data stores, and the data model can be object or with storage by keys (key-value). These storages have a couple of frankly “magical” properties: the ability to reliably store data in a completely unreliable environment — in RAM — and the ability to process data directly in memory, without first saving it to disk and further reading it again.

The grid storage model is based on the principle of “nothing in common” (shared nothing) and implies the distribution of data between multiple physically separated servers. All servers in IMDG are equivalent, and all data is stored in their RAM. At the same time, the grid design is fault-tolerant with the possibility of automatic detection of failed servers, and, if necessary (for example, to change the amount of RAM in the grid), servers can be connected or disconnected without disrupting the infrastructure of the grid.

The properties already described should be enough to attract the attention of architects of large systems to the idea of IMDG. In fact, we get distributed storage of data processed in the same distributed way - and all this at the price of expanding RAM, the specific cost of which, to our great happiness, has recently been decreasing. Ideologically, grid solutions have no restrictions on the amount of RAM in the cluster, so the problems of increasing the amount of data above some boundary for storage technology with them are much less acute.

The choice of a suitable grid solution for a specific project is not an easy task. Partly because grids themselves are quite complex and interesting in the device, and they, like the notorious cat, “need to know how to cook” (this directly means that both launching a cluster and integrating it into your project will take time), partly because that in their evolutionary development IMDG-products are trying to learn from each other useful functions, and you need to carefully choose where the implementation of functionality is more interesting and more suitable for the needs of the project.

We discussed the topic of In-Memory Data Grids with several extremely interesting people. Interesting, among other things, the fact that each of them has experience and an understanding of how a well-functioning system should be designed and implemented. So, imagine:

Andrey Ershov (Dino Systems)

- An interesting technological issue: the choice of grid solutions at the system design stage. What would you advise: to immediately lay on a wide range of features offered by some solution, or to go along the path of gradual complication - to run on something that is simpler, and, as the project grows, explore the possibility of using more functional solutions?

The principles of operation of all IMDGs are similar, but the API may differ. There is even a JSR-107 that unifies working with IMDG, regardless of the vendor. In this case, you can absolutely painlessly go from one solution to another. But there are several "but."
First, this JSR describes only basic things, like “put in cache” / “read from cache”, entry processors and continuous query. Often this functionality may not be enough, and then you have to use vendor-specific functions.

Secondly, it is necessary to carry out functional and load tests with the solution used, because different IMDGs have different parameters, some default values are set to ensure maximum system performance, but many failure cases can be processed incorrectly.

Again, the transition to another solution may imply the need to understand the settings of another vendor's system.
My recommendation is to immediately select a vendor, but first use the community version, and then pay for the enterprise version with technical support and additional features.

Frankly, the choice of the vendor is not easy, especially if you are dealing with IMDG for the first time and do not fully know the scenarios for using IMDG on your system. A good approach in this case is to create a PoC of your system with each of the IMDG solutions.

- It so happened that almost every IMDG solution positions itself as the best for users (if not for all, then for many). Is it possible in this topic to unequivocally talk about the best / worst?

I have a great experience only with GridGain (Ignite), I also have an idea about the features of Coherence and Hazelcast. I can say that all three solutions provide more functionality and have a lot in common. If you need a key-value store with the ability to respond to changes in the grid, then you can safely take any solution. But if you plan to do complex data requests, you need transactional support (with adjustable isolation level), saving data to disk, replicating data between data centers, working correctly in the case of network segmentation, complex data distribution across nodes, then it makes sense to study documentation of each decision and choose the right one for your task.

- If you had a magic wand in your hands, capable of creating one, but very good data storage system - what system would you order it to make?

Of course, I, as a programmer, want the system to be always available, regardless of server crashes and network problems, while keeping it working in the most expected way.

In general, when it comes to expectations about the work of the database or IMDG, then they talk about consistency and freshness. If the system deals with transactions, then the transaction isolation level is added (I recommend CRDT report on codefreeze for study; here is a video of this report - author's note ). The higher the level of consistency and isolation, the more predictable the system behaves and the simpler it is to program it. The best guarantee is strict serializability (strong-1SR): this is a combination of the isolation level of transaction serializability and the level of consistency - linearizability. A system that provides strict serializability is a dream.

Is such a system possible? We, of course, will be interested in distributed systems. Here we recall the CAP-theorem and say that in the case of CP (CA) such a system is possible and even exists - Google Spanner. However, only Google can afford such a system - due to very high infrastructure requirements. And then you have to put up with delays in the replication of data between continents.

I would like to have a system that guarantees strict consistency in the case of an AP, that is, when network problems are possible, as well as if a fast response time is needed. There are no such systems and there can not be, as shown, for example, in PhD Peter Bailis .

Returning to the topic of choosing IMDG: think about what guarantees this or that solution gives, do not the manufacturer’s promises violate the theoretical limitations of distributed systems? Can you rely on IMDG if you have a port burned on a switch, or there is a long garbage collection in the JVM, or has the tractor ripped the cable between your DCs?

Vladimir Ozerov (GridGain)

- GridGain has free and paid versions. Tell me, from a technical point of view, the difference between them is only in the availability of technical support and several functions (such as WAN replication), and the rest of the open source Apache Ignite and the “paid” GridGain are the same? Or are we dealing here with the Fedora vs. model? RHEL ”, when on the free version pass the“ battle run-in ”features, which will later be included in a paid delivery in a more stable form?

There are three versions of the product. GridGain Professional is the Apache Ignite codebase plus support and prompt editing and enhancements (hotfixes). This is a critical moment for business, since in open source no one owes you anything, and no one to ask.
In addition to Professional, we offer GridGain Enterprise and GridGain Ultimate - these are products based on GridGain Professional with enhanced functionality, such as WAN replication, security, rolling upgrades, snapshots, etc.

Hotfixes immediately get into the master Apache Ignite, but are released earlier as part of GridGain Professional. Therefore, paid users receive them immediately, and free users either wait for the next Apache Ignite release, or assemble it themselves from the master, at their own peril and risk.
We do not run in free users :-)

- Apparently, Apache Ignite (or, if you prefer, GridGain) is positioning itself as a broader version of other IMDG systems. Is this really a reason for pride or is it just a marketing approach to look more profitable than others?

Simply, our strategy consists of three points. First, we work on any platform and with any frameworks. In our arsenal there is a rich support of Java, C # and C ++ languages, dozens of integrations - from Spring / Hibernate to Hadoop / Spark, and we continue to strengthen this area. The second is SQL as the key way to access data. This is a different weight category compared to any key-value API. We are actively developing our own SQL engine and JDBC / ODBC drivers for it. The third is persistence. We have learned to work with both memory and disk. Now Apache Ignite is a distributed, horizontally scalable DBMS that can store more data than your RAM allows.

This shows that we have really gone far ahead of the classic term “IMDG”. However, Apache Ignite is still a fast and convenient grid, one does not interfere. Well, who is “wider” or “already” - it's up to users :-)

- Projects that require the use of IMDG are almost always non-standard, and solutions, both architectural and technical, will be developed specifically for the project. In your opinion, should a company using Ignite / GridGain in its projects have competence at the level of technological understanding of the device of the selected grid solution? Does the company need a dedicated IMDG DBA specialist?

Very correct question. We are really confronted with a lack of skills in applying and administering our system, since it is too different from the classical DBMS that everyone is used to. Now it largely falls on the shoulders of our solution architects. But we are carrying out many-sided work in this direction. Our focus is on documentation, training and building partnerships with integrator companies. All this contributes to the dissemination of knowledge and experience. I think in the perspective of several years such skills will become quite massive.

Victor Gamov (Confluent; formerly Hazelcast)

- It so happened that almost every IMDG solution positions itself as the best for users (if not for all, then for many). Is it possible in this topic to unequivocally talk about the best / worst?

I would talk about the use case, and not about the concept of "best / worst." It is necessary to build on the moments related to the internal architecture of the project. IMDG solutions are developed each in their own direction, and comparing them without answering the question “best for one’s own” is wrong.
Grids, unlike solutions for distributed caches, can perform calculations directly on their nodes - this is a niche functionality, but very convenient for certain types of tasks. However, sometimes IMDG is used only as a cache, and here, of course, it is difficult to compare different solutions, since the task is not targeted for grids. Later, by the way, the developers most often, having dealt with the grid, begin to use other functions of their previously chosen IMDG.

In any case, you need to talk about grids from the standpoint of a developer, and you need to proceed, firstly, from the task that this developer solves, and secondly, from his previous experience and the stack used in the project.

- There are many “purely IMDG” solutions on the market and practically only one (Ignite / GridGain), which positions itself as something more. Is it right to say in such a situation that there is a niche for all the solutions, or can we say that someone has technologically gone forward, and someone else has not mastered additional functions?

I recall the phrase of the time of rapid growth of Microsoft: "Let's concentrate on everything!"
I think the presence of a large number of functions in a fairly fairly niche product is, in many respects, a marketing issue. Each IMDG is somehow interesting, and each is at some point in its development, so the pace of adding functions is different everywhere.

IMDG is strong in three aspects: data storage, computation on this data, and distributed communications. Remaining features are added when the need arises in it. I think GridGain implements some interesting functions, including taking into account the wishes of customers (for example, in the framework of its cooperation with SberTech), while other IMDG solutions do not see the immediate need to disperse forces on expanding the functionality of their product.

On the other hand, the benefits from such a narrow system, like IMDG, receive, first of all, the developers, not end users. And the developers are already deciding how they will take advantage of the opportunities offered by IMDG, and how they can change the logic of the application to use these features. ( Link to Victor's report in English - author's note).

“For a long time, the concept of IMDG seemed detached from reality: storing all the data in the memory was expensive and somehow irrational, and the applications worked through interfaces optimized for sampling and processing of a carefully selected part of the data. Today IMDG is increasingly being used as a kind of “silver bullet”. Is it possible to expect that this technology will be on the market for a long time, that this is not just another fashionable trend, which one day will be replaced by another?

I think yes. In my opinion, the cost of memory will fall, the cost of servers will also decrease, the speed of communication over the network will increase - if your task requires reliable storage of large amounts of data with fast access to them, using IMDG will be an excellent choice for a long time. Such tasks will always be there, so grids as a tool will be in demand.

Here I can see that IMDG fits perfectly into the stream processing scheme. If we have a stream of information, and we want to make statistics on it in real time, then with IMDG, the data will be processed immediately (without being stored, say, on disks - right in memory, but with a guarantee of information integrity, even when exiting building nodes in a cluster) and immediately can be used further - for example, displayed on a dashboard. At the same time, note that there is no need to first load data into a certain database (disk), then process it, transfer it to the same disk, and then read it from the disk “outside” - we keep it in memory all the time, significantly saving on moving unnecessary us in long-term data storage "back and forth."

Will IMDG be a silver bullet? Hardly: this is, as I said, a niche solution, but grids will definitely be in demand.

You can discuss this topic for a long time: argue, agree, holivarit and even sometimes beat in the soup ... And if the first three options can be implemented in the comments, then the fourth one (although the first three can also be practiced) should go somewhere. Where? You can go to Joker 2017 Java conference (November 3-4, St. Petersburg) or DevOps 2017 DevOps conference (October 20, St. Petersburg) - there will be someone to talk about storage.

Source: https://habr.com/ru/post/339322/

All Articles

Many, fast, distributed: how to choose an in-memory data grid solution

Andrey Ershov (Dino Systems)

Vladimir Ozerov (GridGain)

Victor Gamov (Confluent; formerly Hazelcast)

More articles: