
Hey. This is a post about the new version of the
Tarantula "from the author." The Internet is entertaining: if you search for Tarantula, you will find an article from 2011 about version 1.3. And some kind of puncher, it seems. In the forums, boards, in general, is a thick fog. Tarantula "well, it's like radishes, only" ...
Or, recently, he made a discovery for himself, on the
Toaster someone wrote, “Sofia is such an append-only store like Tarantula”. With such posts, I will soon become a fan of the site “made by us”, the Kalashnikov assault rifle and the Sayano-Shushenskaya hydroelectric station. However, it is difficult for me to understand why we admire Western instruments, and we have no idea about our own. So, Tarantool 1.6. What is the trick?
On the site we write to ourselves that this is something like a mix between Node.JS and Redis. The basic idea is to quickly work with large amounts of data in memory, while doing something more nontrivial than key / value or the data structures that Radishes provide. The idea is that you have dozens, hundreds of gigabytes of live data, and you have the opportunity to do something really complicated with them. Antikash, count some tricky correlations, keep the state of the online game, and in real-time calculate the ratings of players, etc.
You can look at all this and as a cunning Memkesh, we all can do it, but you can not see it behind the trees of the forest. For example, the main difference between Tarantula and Node.JS is that in Tarantula there is not an event-oriented model, but green streams. You can write classic sequential code, and not be hung with callbacks and futures, and it all works just as well. There is non-blocking access to sockets, files, external databases (MySQL, PostgreSQL). The main difference from Radish is that we have a MongoDB data model, not a “data structure”. There are, for example, secondary indexes that are updated automatically, consistently, atomically. And it all uses less memory. But, IMHO, not even that important. The main thing is that we have transactions. Normal begin, commit, rollback in stored procedures. And also, unlike Radish, we have a diskstore. But about diskstore separately.
')
But besides the “main” differences, not the main ones are enough. First, as I already wrote, we, as in MongoDB, have the opportunity to set or change the schema or indexes on the fly. Our word for a table or collection is space. New spaces can be created and modified on a working base, and you can also add and delete indexes. Our indexes have the ability to iterate - that is, you can see all the data, ascending, descending. In general, iterators in Tarantula are the main mechanism for describing complex logic and implementing arbitrary data structures, that is, what is already done in Radish for you. For freedom, of course, you have to pay with complexity. Tarantula is the only in-memory database with a full-fledged
R-tree index , i.e. the ability to search by points and polygons.
Secondly, we have diskstore. That is, certain spaces can store data that is many times larger than the amount of available RAM. Diskstor is implemented on the architecture of connected storage engines, similar to MySQL, but unlike MySQL, all of our engines support transactions and use a common binlog. About why this is important - read, for example, in this
article by Oleg Tsarev, here, on Habré. For the discstore, the Sofia library is used, which was also written in our team.
Third, we have another replication. Tarantula 1.6 uses an asynchronous master-master, and it is worthwhile to dwell upon it. In a classic master slave replication, only one server can be the source of change. In the asynchronous wizard, you can update the data on the replica. That is, there is a scalability both in reading and in writing. BUT! There is one very big BUT, which is worth bearing in mind. The word "asynchronous" is not just. At the time of the changes, the servers do not “coordinate” the changes with each other, the updates “come” over the network to the replica after the commit on the master. Therefore, you can easily “break everything” by updating the same data on two servers at once. But there are many cases when such a master is needed. For example, when you need high availability of each replica, with a large distance between them. For example, London and Miami, i.e. the inability to synchronously update the same field on both nodes. It is worth noting that in 1.7 in addition to asynchronous replication, we “prepare” synchronous, as well as automatic scaling of data across many nodes. And in this “option package” the Tarantula will become the only RAM database on the OSS market with 100% fault tolerance and transaction support. However, this is 1.7, we still need to live to see it.
Well, in short everything. Currently 1.6 lives in production in several Mail.Ru projects, in Sberbank :), and all the winter long we drank the bugs we found in our deployment in Avito. I hope they will want to tell in more detail what the Tarantula is used for, and why they didn’t choose anything else, despite all the beta bugs. And whether you will use the Tarantula in your project, decide for yourself. Only not on posts on boards, and according to documentation. Well, or here on this report. If you have any questions - I will try to answer them in the comments.