I would like to thank
0leo ,
morisson and
adaptun for help in preparing the article.
Astrologer tools

As you know, tomorrow, December 19, 2013, the
Gaia telescope will be launched. Many have already read the
article about the mission, but few know what technology the developers of the European Space Agency
have chosen to process and store Gaia data. In 2011,
IBM DB2 ,
PostgreSQL ,
Hadoop ,
Cassandra, and
Caché were considered candidates (more precisely,
Caché eXtreme Event Persistence ; see, for example,
Astrostatistics and Data Mining , ed. Luis Manuel Sarro) , Laurent Eyer and William O'Mullane, p. 111-112).
On the first four "players" of the industry knows, perhaps, almost every student. But what is the
Caché XEP ?
Java technology in Caché
If you look at the Java API stack provided by
InterSystems , we see something like the following:
')
- Caché Object Binding technology, transparently projecting an object-oriented representation of data in Java. In Caché terms, the generated Java proxy classes are called projections (projections). This approach is the easiest to use, since it retains the “natural” connections between classes in the object model, but at the same time it is distinguished by a rather low speed: quite a lot of service (meta) information describing the object model is transmitted via wires.
- JDBC and various add-ons ( Jalapeño , Hibernate , JPA ). Here, I probably will not say anything new, except that Caché supports two levels of transaction isolation:
READ_UNCOMMITTED
and READ_COMMITTED
— and by default it works in READ_UNCOMMITTED
mode. - The Caché eXtreme family (also existing in editions for .NET and Node.js ). This approach is characterized by direct access to low-level data representation (so-called “global” - information quanta in the world of Caché ), which provides a very high speed of work.
Handling complex events
If you read the name of the mentioned technology, it can be seen with the naked eye that it hints at the connection with the task of processing complex events (XEP -
eXtreme complex Event Processing ). This problem is solved everywhere:
- in trading systems, including in algorithmic trading systems;
- in bank security systems;
- in the systems of collecting and analyzing statistics in real time (road analysis, weather forecasting, monitoring of social networks).
What are the current requirements for technologies that help solve the problem? The requirements are simple and few:
- processing (possibly including saving) in real time at least 1000 eps (events per second), and in practice we often deal with tens of thousands of transactions per second, and the products described, t. obr., belong to the class of XTP systems ( [1] , [2] );
- recognition of the correlation between the events being processed (as in the classic example: the bell ringing plus the man in black, the woman in white leading by hand, most likely means the wedding is in progress);
- pattern matching ( pattern matching ), i.e. filtering (again, in real time);
- Naturally, XML processing (as without it?);
- support for the execution of business rules ( business rules );
- event handling of complex structure (with a large number of fields);
- archival storage of the event history, at least in the last 24 hours (i.e. approx. 100M events);
- finally, fault tolerance.
If you look at the IT market, you can immediately call the following interesting implementation:
- IBM WebSphere Business Events ;
- Sybase ESP - you can even download and “touch” a test version of the product (the distribution kit weighs about 1000 MB);
- Software AG Apama CEP Platform ;
- TIBCO BusinessEvents ;
- TIBCO StreamBase (TIBCO, due to several successful purchases, now have two competing products).
To be fair, it should be noted that
Caché eXtreme technology itself is not a CEP / XEP implementation: XEP is not deciphered as
eXtreme Event Processing , but merely as
eXtreme Event Persistence . However, we see no reason why the notorious CEP paradigm would be technically impossible to implement within the InterSystems technological stack with the involvement of a product such as
Ensemble . In addition, many implementations achieve similar goals using
Caché eXtreme in conjunction with
Esper or
NEsper .
Let's return to our sheep
If you look at the high-level architecture of
Caché eXtreme , then it is quite simple:
Here, the
Globals API provides fast, low-level access to globals. The functionality of the
Globals API is also available as a free NoSQL database -
GlobalsDB .
Caché XDO is a fast “dynamic” data access that does not require the presence of an object model on the client side. The closest equivalent from the Java world is
Reflection .
Finally, the module with the not quite euphonious name for the Russian ear
Caché XEP , also based on the
Globals API , provides fast object and quasi-relational access to data. Object - in the sense that the API client does not need to worry about object-relational mapping: in the image and likeness of the Java object model (even in the case of complex multi-level inheritance)
, an object model
is automatically created at the
Caché class level (or the database schema, if you go to relational view). And quasi-relational, in the sense that
you can execute SQL queries (more precisely, queries that use a subset of SQL) directly from the context of the
eXtreme connection, and, moreover,
indexes and transactions are also supported on a set of “events” loaded into the database. . Of course, all downloaded data is immediately available via JDBC through a relational view (with the ability to use all the power of ANSI SQL plus the SQL extensions characteristic of the
Caché dialect), but the access speed will be completely different. Once again, as a resume, we have:
- “schema” import ( Caché classes are created automatically), incl.
- import hierarchy of Java classes;
- instant relational data access - you can work with Caché classes as with tables;
- index and transaction support with Caché eXtreme ;
- support for simple SQL queries using Caché eXtreme ;
- support for arbitrary SQL queries through the underlying eXtreme JDBC connection.
This approach gives some advantages over similar relational (higher access speed), and over various NoSQL solutions (immediate access to data in relational style).
Additionally, you can say that an
eXtreme connection can be of two types: a connection that uses JNI (but requires that the Caché server be available locally - the difference from the JDBC connection of the 2nd type is that data transmission over the network is not supported), and normal TCP connection, where data transmission is carried out using a standard type 4 JDBC driver.
The “subtlety of tuning” of the JNI version consists solely in the need to set up the environment:
- the
GLOBALS_HOME
variable must point to the directory containing the Caché installation, and LD_LIBRARY_PATH
( DYLD_LIBRARY_PATH
for Mac OS X or PATH
for Windows ) must contain ${GLOBALS_HOME}/bin
.
For the TCP version, it is sufficient to increase the size of the stack (stack) and heap (heap) of the
JVM (
-Xss2m -Xmx768m
).
Some practice
The authors were interested in how
Caché eXtreme behaves in the task of recording a continuous stream of data compared to popular data processing technologies. As a source of data, historical quotes were taken from the
Finam holding site, which can be downloaded in CSV format (the authors of the article are grateful to the creators of such a wonderful resource).
Since it is hardly realistic to find the desired link on the site (the second and subsequent times we did not succeed), we
share it here .
As a result, a rather naive test was written, which completely ignores the rules for writing microtest performance in Java. In particular, never got
around to screw the
JMH . Some excuse can be that we do not measure the speed of the code generated by the
JIT , but the speed with which the
JVM code (with the exception of
Apache Derby ) that can write to disk. The question of
whether the hard disk
that participated in the tests obeyed the
fsync()
syscall, alas, we also ignored.
So, in the race participated:
- Apache Derby 10.9 (some well-known commercial relational DBMSs showed similar performance)
- InterSystems Caché 2013.1 (JDBC)
- InterSystems Caché 2013.1 (eXtreme)
Let us say at once that, due to the approximation of tests, we see no point in giving exact figures: the error is large enough, and the purpose of the article is only to demonstrate the general tendency. For the same reasons,
as well as the inability to configure GC , we do not specify the exact JDK version and the garbage collector settings: server 6u45 and 7u40 with
-Xmx2048m -Xss128m
showed similar performance on
Linux and
Mac OS X. In each of the tests, about a million events were saved; the test for each separate database was preceded by several (up to 10) “warming up” launches. As for the
Caché settings, the program cache (routine cache) was increased to 256 MB, and the eight-kilo byte cache database (8kb database cache) - up to 1024 MB.
Actually, the results are as follows:
Derby and other relational DBMSs give a write speed ranging from 1000 to 1500 events / sec.
Caché in JDBC mode has a higher speed (from 6000 to 7000 eps), but this speed comes at a price: the default isolation level for transactions, as mentioned above, is
READ_UNCOMMITTED
. Further,
Caché eXtreme gives 45000-50000 eps in pure-Java mode and more than 80,000 eps when communicating with a local
Caché instance via JNI. Finally, if you take some risk and
disable the transaction log (for a single current process), then on a test machine it was possible to bring the write speed for a JNI connection to 100000 eps.
Anyone who is interested in more accurate numbers, or would like to make the tests more correct, or whom the given results touched a little, we suggest reading the
source code . To build and run, you will need
JDK 1.6+,
Git ,
Maven (including the
Maven Install Plugin to create local artifacts of
Caché JDBC and
Caché eXtreme ), and finally
Caché (We recommend ordering a
temporary license key - it's free. We also We offer universities to join the
InterSystems Campus program and get official Caché distributions with an academic license (in any case,
InterSystems consultants are ready to help with load testing in your project).
Comments are welcome.
Chmbls
- Until you got your hands to write a Caché backend for Yahoo! Cloud Serving Benchmark .
- It would be interesting to learn JMH instead of home-grown speedometer
bikes .