I continue a short series of articles on the In-memory-data-grid.
The
first article revealed the very concept of IMDG without concrete examples and implementation details. Today we dig a little deeper.
Modes of IMDG
The modes of operation are not fundamentally different depending on the specific IMDG solution, therefore, the following is true for the concept of IMDG as a whole, and not for individual solutions.
1. Local mode
In this mode, an IMDG cluster consists of only one node. It is mainly used for debugging purposes only.
2. Replicated mode
In replicated mode, the complete data set is replicated to each node in the cluster.
')
Pros :
- High speed data access, because all data is available on each cluster node
- The loss of any node does not lead to a rebalancing of the cluster, since you do not need to redistribute data
Cons :
- This mode is applicable only if all the necessary data for the work is stored in the memory of a single node.
- Increased memory consumption
- PUT of an object into a cluster should result in updating this object on all nodes, therefore additional efforts are needed to ensure the consistency of data in the cluster, therefore PUT in this mode is slower
- Limited ability to scale, since each PUT necessitates data replication to each node
So that each PUT does not take so much time, you can perform it asynchronously (many IMDGs provide such an opportunity), but then only you are responsible for the consistency of the data. Therefore, I would not use this mode of operation in write intensive systems.
3. Distributed mode
The most interesting and used mode of the IMDG, in which you can appreciate all the positive qualities of this concept.

The description of this mode was the basis of the
previous article .
Indices
To search for data in IMDG, we use the inverted index search.
1. Oracle Coherence
Indexes are represented by objects that implement the MapIndex interface.
Currently (Oracle Coherence 3.7), 2 index implementations are available:
- SimpleMapIndex - ordinary inverted index
- ConditionalIndex - expands SimpleMapIndex, in contrast to which it allows you to specify the condition under which the cache object will be indexed
The index is distributed, i.e. on each cluster node, only the data that it contains is indexed. When executing a request to the entire cluster, each node separately calculates its part of the overall response, then these parts are transferred to the node from which the request was made, where they are collected into one general response.
If the request requires accessing several indices at once, then first a response is formed for each of the indices, and then those sets that turned out to intersect with each other in order to get the final result. This intersection does not occur instantaneously, so before you make a query that requires accessing several indexes, you need to think about whether this will lead to the intersection of huge sets of keys.
Pros :
- Distributed index provides search parallelization
- You can control which objects need to be indexed.
Cons :
- The intersection of several large sets can take longer than a search on non-indexed data.
- Search speed for such an index is lower than for bit-based indices (bitset)
- Index update is incremental, i.e. when adding an object to the index it is not necessary to completely rebuild it
2. JBoss Infinispan
Here,
Apache Lucene (open source full text search engine) and
HibernateSearch (which is based on the same Lucene) are used to search for caches.
This choice has significant drawbacks, but there are also advantages.
Pros :
- The result of the query to the index is a bitset, therefore the intersection of such sets (if your query requires searching by several indexes) is a cheap operation.
- Full text search support
- Fast search (because produced by imposing a binary mask)
Cons :
- The index is not distributed , it is stored in the Apache Lucene Directory (or rather, it is some kind of adapted implementation of the ALD itself)
- Only one thread at a time can update an index.
- Index update only through its full rebuild
3. VMWare Gemfire
When indexing data, 2 types of index are used:
Primary Key Index and
Functional Index .
The difference between the two is that the Primary Key Index allows you to check the value of an indexed attribute only for equality of some constant, and the Functional Index allows you to perform a comparison. For example, you can select objects with a field someField> 10.
Index update can be performed synchronously (provides consistency) or asynchronously (index update speed).
In general, the pros and cons are the same as those of Oracle Coherence.
4. Hazelcast
It does not have a division of the index into types, but the principle of operation of the indices is the same as that of Oracle Coherence, so it makes no sense to write them separately.
Locks
If your application allows the possibility of multi-stream writing of data to an object, then a lock mechanism is usually used to ensure data integrity. And this mechanism works reliably if you are within the same machine. But what if your data is distributed across a cluster?
In this case, IMDG solutions have distributed lock implementations.
Distributed locking (distributed lock) is a lock that is available on all nodes of the cluster and has the same state on all of these nodes. Those. impossible situation in which 2 threads on different nodes simultaneously owned the same lock.
Distributed locking ensures synchronization of data access in a cluster.
Conclusion
In the next article I will try to talk about the results of comparing different IMDG and NoSQL solutions, but, as you understand, this will take some time, so do not wait for the article before mid-September. I invite everyone to participate in the discussion of the results :)