📜 ⬆️ ⬇️

How do the stars count? Using InterSystems Caché eXtreme at Gaia

I would like to thank 0leo , morisson and adaptun for help in preparing the article.

Astrologer tools



As you know, tomorrow, December 19, 2013, the Gaia telescope will be launched. Many have already read the article about the mission, but few know what technology the developers of the European Space Agency have chosen to process and store Gaia data. In 2011, IBM DB2 , PostgreSQL , Hadoop , Cassandra, and Caché were considered candidates (more precisely, Caché eXtreme Event Persistence ; see, for example, Astrostatistics and Data Mining , ed. Luis Manuel Sarro) , Laurent Eyer and William O'Mullane, p. 111-112).

Fragment of the book

On the first four "players" of the industry knows, perhaps, almost every student. But what is the Caché XEP ?

Java technology in Caché


If you look at the Java API stack provided by InterSystems , we see something like the following:
')


Handling complex events


If you read the name of the mentioned technology, it can be seen with the naked eye that it hints at the connection with the task of processing complex events (XEP - eXtreme complex Event Processing ). This problem is solved everywhere:


What are the current requirements for technologies that help solve the problem? The requirements are simple and few:



If you look at the IT market, you can immediately call the following interesting implementation:


To be fair, it should be noted that Caché eXtreme technology itself is not a CEP / XEP implementation: XEP is not deciphered as eXtreme Event Processing , but merely as eXtreme Event Persistence . However, we see no reason why the notorious CEP paradigm would be technically impossible to implement within the InterSystems technological stack with the involvement of a product such as Ensemble . In addition, many implementations achieve similar goals using Caché eXtreme in conjunction with Esper or NEsper .

Let's return to our sheep


If you look at the high-level architecture of Caché eXtreme , then it is quite simple:



Here, the Globals API provides fast, low-level access to globals. The functionality of the Globals API is also available as a free NoSQL database - GlobalsDB .

Caché XDO is a fast “dynamic” data access that does not require the presence of an object model on the client side. The closest equivalent from the Java world is Reflection .

Finally, the module with the not quite euphonious name for the Russian ear Caché XEP , also based on the Globals API , provides fast object and quasi-relational access to data. Object - in the sense that the API client does not need to worry about object-relational mapping: in the image and likeness of the Java object model (even in the case of complex multi-level inheritance) , an object model is automatically created at the Caché class level (or the database schema, if you go to relational view). And quasi-relational, in the sense that you can execute SQL queries (more precisely, queries that use a subset of SQL) directly from the context of the eXtreme connection, and, moreover, indexes and transactions are also supported on a set of “events” loaded into the database. . Of course, all downloaded data is immediately available via JDBC through a relational view (with the ability to use all the power of ANSI SQL plus the SQL extensions characteristic of the Caché dialect), but the access speed will be completely different. Once again, as a resume, we have:



This approach gives some advantages over similar relational (higher access speed), and over various NoSQL solutions (immediate access to data in relational style).

Additionally, you can say that an eXtreme connection can be of two types: a connection that uses JNI (but requires that the Caché server be available locally - the difference from the JDBC connection of the 2nd type is that data transmission over the network is not supported), and normal TCP connection, where data transmission is carried out using a standard type 4 JDBC driver.

The “subtlety of tuning” of the JNI version consists solely in the need to set up the environment:



For the TCP version, it is sufficient to increase the size of the stack (stack) and heap (heap) of the JVM ( -Xss2m -Xmx768m ).

Some practice


The authors were interested in how Caché eXtreme behaves in the task of recording a continuous stream of data compared to popular data processing technologies. As a source of data, historical quotes were taken from the Finam holding site, which can be downloaded in CSV format (the authors of the article are grateful to the creators of such a wonderful resource).

Actually resource


Since it is hardly realistic to find the desired link on the site (the second and subsequent times we did not succeed), we share it here .

As a result, a rather naive test was written, which completely ignores the rules for writing microtest performance in Java. In particular, never got around to screw the JMH . Some excuse can be that we do not measure the speed of the code generated by the JIT , but the speed with which the JVM code (with the exception of Apache Derby ) that can write to disk. The question of whether the hard disk that participated in the tests obeyed the fsync() syscall, alas, we also ignored.

So, in the race participated:



Let us say at once that, due to the approximation of tests, we see no point in giving exact figures: the error is large enough, and the purpose of the article is only to demonstrate the general tendency. For the same reasons, as well as the inability to configure GC , we do not specify the exact JDK version and the garbage collector settings: server 6u45 and 7u40 with -Xmx2048m -Xss128m showed similar performance on Linux and Mac OS X. In each of the tests, about a million events were saved; the test for each separate database was preceded by several (up to 10) “warming up” launches. As for the Caché settings, the program cache (routine cache) was increased to 256 MB, and the eight-kilo byte cache database (8kb database cache) - up to 1024 MB.

Actually, the results are as follows:



Derby and other relational DBMSs give a write speed ranging from 1000 to 1500 events / sec. Caché in JDBC mode has a higher speed (from 6000 to 7000 eps), but this speed comes at a price: the default isolation level for transactions, as mentioned above, is READ_UNCOMMITTED . Further, Caché eXtreme gives 45000-50000 eps in pure-Java mode and more than 80,000 eps when communicating with a local Caché instance via JNI. Finally, if you take some risk and disable the transaction log (for a single current process), then on a test machine it was possible to bring the write speed for a JNI connection to 100000 eps.

Anyone who is interested in more accurate numbers, or would like to make the tests more correct, or whom the given results touched a little, we suggest reading the source code . To build and run, you will need JDK 1.6+, Git , Maven (including the Maven Install Plugin to create local artifacts of Caché JDBC and Caché eXtreme ), and finally Caché (We recommend ordering a temporary license key - it's free. We also We offer universities to join the InterSystems Campus program and get official Caché distributions with an academic license (in any case, InterSystems consultants are ready to help with load testing in your project).

Comments are welcome.

Chmbls


Source: https://habr.com/ru/post/194814/


All Articles