At the end of July,
Apache Ignite 2.1 was released. Apache Ignite is a distributed free
HTAP platform (HTAP - Hybrid Transactional and Analytical Processing, systems that can handle both transactional and analytical load) for storing data in RAM and on disk, as well as real-time computing. Ignite is written in Java and can be tightly integrated with
.NET and
C ++ .
Version 2.1 is very rich in meaningful, practically applicable functions, based on the foundation laid down in Apache Ignite 2.0.
With Apache Ignite 2.1, you can use Apache Ignite Persistent Data Store distributed disk storage with SQL support, the first distributed machine learning algorithms, new DDL functions, and in addition, support for .NET and C ++ platforms has been greatly improved.
')
Persistent Data Store brings Apache Ignite to a new segment - now it’s not just an in-memory data grid, but a full-featured distributed scalable HTAP database with the ability to reliably store primary data, with SQL support and real-time information processing.
Apache Ignite PDS - Persistence Data Store
Persistence Data Store is a new platform component that allows you to use not only RAM, but also a disk for storing data.
We have invested so much programming effort into the development of PDS due to the limitations of classic third-party DBMS when used in conjunction with systems like Apache Ignite.
- DBMS - a bottleneck when saving data. With the help of in-memory computing and horizontal scaling technologies, significantly higher read rates can be achieved. But the record will remain a bottleneck, because the platform will need to duplicate write operations in a DBMS, which almost always scales much worse, if it scales at all. And the need for fast writing can sometimes be higher than the need for fast reading. For example, one of our clients has only a few dozen read requests per second (a very low latency is required for random reading), and at the same time tens of thousands of write operations per second that need to be transactionally saved for later access.
- A DBMS is a single point of failure. Apache Ignite clusters can consist of tens and hundreds of nodes that reserve each other, ensuring the security of client data and operations. In this case, the DBMS in most cases has much more limited redundancy capabilities. The fall of a DBMS in such a situation will cost much more, it will be more stressful for a company than the fall of one or several cluster nodes.
- The inability to simultaneously run SQL over data in Ignite and in the DBMS. You can perform ANSI SQL 99 queries on top of the data stored in the memory of the Apache Ignite cluster, but you cannot correctly perform such queries if at least some of the data is inside a third-party DBMS. Cause: Apache Ignite does not control this DBMS and does not know what data is in it and what is not. A pessimistic hypothesis that not all data is in RAM would lead to the need to duplicate each query in a DBMS, which would significantly reduce performance (not to mention the need to integrate with a variety of SQL and NoSQL solutions, each of which has its own characteristics and limitations with data that is actually an impossible task!).
- The need for warming up. To use Apache Ignite, you must first load the necessary data from the DBMS to the cluster, which can take tremendous time with large amounts of data. Imagine how many will be loaded into a cluster, for example, 4 terabytes of data! Apache Ignite cannot do this as needed without additional logic written by the user for the same reasons as in the previous paragraph.
The implementation of PDS allowed us to organize a distributed, horizontally scalable storage, which is limited in size primarily by the available physical disk arrays, and which solves the problems described above.
- PDS is scaled along with the cluster. Each Apache Ignite node now stores its data section not only in memory, but also on disk. This gives near-linear scalability in write performance.
- PDS supports Apache Ignite failover mechanisms. For efficient distribution and backup of data, you can use the same redundancy ratio and collocation function. The proprietary GridGain module will also allow you to make backup online-copies of the entire disk space. They can be used for subsequent disaster recovery (DR), or deployed on another cluster, including a smaller one, for non-real-time analytics that could load the main cluster too much.
- PDS allows you to perform SQL queries in relation to data both in memory and on disk. Since Apache Ignite is tightly integrated with this repository, the platform has comprehensive information on the location of the data and can perform cross-cutting SQL queries. Due to this, the user gets the opportunity to use Apache Ignite as Data Lake - to store a large archive of information that will lie on the disk, and if necessary, to perform requests based on this information.
- PDS allows you to use the “lazy run” - load data into memory gradually, as needed. In this case, the cluster can start almost instantly, in a cold one, and as data is accessed, they will be gradually transferred to RAM. This allows you to significantly speed up disaster recovery (DR RTO), reduces business risks, and also simplifies and reduces the cost of testing and development procedures by saving time during heating.
In this case, PDS provides a higher level of performance than third-party products due to the lack of unnecessary network interaction.
Technical detailsThe interaction between memory and disk is based on the technologies
implemented in Apache Ignite 2.0 . Available storage space is divided into pages, each of which may contain 0 or more objects, or a section of an object (in cases when the object exceeds a page by volume). The page can be located in RAM with duplication on the disk, or only on the disk.
By default, the page is created and maintained in memory, but when the available volume is exhausted, the page is pushed out to disk. Further, when requesting data from this page, it will be returned to memory.
The indices are on separate pages, and with well-constructed queries, the search should go through the data in the indexes that are contained in memory, in order to avoid a full scan and consistently raising the entire array of information from the disk.
Persistence Data Store is very easy to enable:
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="persistentStoreConfiguration"> <bean class="org.apache.ignite.configuration.PersistentStoreConfiguration"/> </property> </bean>
or
igniteConfiguration.setPersistentStoreConfiguration( new PersistentStoreConfiguration());
Important! This will change the procedure for starting the cluster. After the start of all nodes, it will be necessary to activate the cluster, for example, in the web console, or by calling
ignite.activate(true)
in the code. This is necessary in order to make sure that all the nodes that store the information in them have entered the system and are ready to work with their part of the data set. Otherwise, the cluster would consider that part of the primary or backup copies of the data is missing and would start rebalancing (redistributing copies of data between nodes to provide the necessary level of redundancy and uniform load distribution), which is not the desired behavior.
Summarizing , Persistent Data Store brings Apache Ignite to a new segment, allowing users to use Ignite not only as a data processing system, but also as a horizontally scalable primary storage layer.
Machine learning algorithms
In Apache Ignite 2.1, the first practically applicable distributed machine learning algorithms appeared, based on
Apache Ignite 2.0 technologies -
k-means clustering and
multiple linear regression using the least squares method .
It is quite simple to use them, and these are the “first signs”: in future versions we will add new algorithms that can be used for various purposes.
Application Example (Java) KMeansLocalClusterer clusterer = new KMeansLocalClusterer(new EuclideanDistance(), 1, 1L); double[] v1 = new double[] {1959, 325100}; double[] v2 = new double[] {1960, 373200}; DenseLocalOnHeapMatrix points = new DenseLocalOnHeapMatrix(new double[][] {v1, v2}); KMeansModel mdl = clusterer.cluster(points, 1); Integer clusterInx = mdl.predict( new DenseLocalOnHeapVector(new double[] {20.0, 30.0}));
The data distribution model can be managed using different implementations of matrices and vectors. So DenseLocalOnHeapMatrix only uses data storage on the local host on the managed JVM heap.
DDL
Apache Ignite 2.1 has extended support for DDL. If in
Apache Ignite 2.0 appeared work with indexes , then the new version gives the opportunity
to work with tables :
CREATE TABLE
and
DROP TABLE
.
CREATE TABLE IF NOT EXISTS Person ( age int, id int, name varchar, company varchar, PRIMARY KEY (name, id)) WITH "template=replicated,backups=5,affinitykey=id"; DROP TABLE Person;
Tables are created in the PUBLIC scheme, a new cache is created for each of them. The tables have additional settings that can be specified in the WITH section:
- TEMPLATE - set the cache of this table on the specified cache, there are special values
PARTITIONED
and REPLICATED
, which indicate to use the default settings and the corresponding cache distribution model. - BACKUPS - number of backup copies of data - redundancy ratio.
- ATOMICITY - can take
ATOMIC
and TRANSACTIONAL
values, that is, turns on or off the transaction on this cache. - CACHEGROUP is the group to which the cache belongs.
- AFFINITYKEY is the column name (must be part of the PRIMARY KEY!), Which should be used as a partitioning key for correct data collocation.
In future releases, we plan to add the ability to set columns as
NOT NULL
, specify default values,
AUTO INCREMENT
parameter, etc., and also want to expand the dictionary of DDL expressions.
Cache groups
In Apache Ignite 2.1, it became possible to use
groups of caches that share common storage, partitions, while allowing you to store various kinds of values. This approach can significantly reduce the overhead of the establishment of a large number of caches both in terms of the necessary memory space and in terms of processor resources to maintain and synchronize information on these caches.
At the same time, such groups will have a common physical cache, respectively, and such components as the affinity function, node filter, cache operation mode, partition loss handling policy, memory utilization policy.
And much more