How to save a million dollars with Tarantool

What are databases used for, are there good old files? How are they worse than the database, or is the database better than files? DB - more structured storage. It allows you to make transactions, requests and so on. The simplest case: there is a server with a database and several applications that make requests to the server. The database responds, changes something inside of itself, and everything is fine until the moment when the load on it grows so much that the database can no longer cope.

Assuming that this is only a read load, then the problem is solved by replication. You can put as many replicas to the database as you need, and start all readings on a cue, and all records on the master. If the database is under load on the record, then replication does not solve this problem, because the record should be carried out on all replicas. Thus, no matter how much you put them, you will not reduce the load on the recording per one machine. Here sharding comes to the rescue.

If the base does not hold the load on the record, the shards can be added to infinity. Shard is more complex than a replica, because you need to somehow distribute the data to tables or within a table, according to a hash, according to range - there are many different options. Thus, adding replicas and shards, you can divide any load on the database. It would seem that there is nothing more to desire, what next to talk about?

But there is a problem

')
... which is no longer in the plane of technology. Your boss, seeing an ever-growing server fleet, begins to resent, because it takes a lot of money. The load is growing, the number of requests from users is growing, and you are adding and adding servers. You are a techie, you don’t think about money - let the financiers do it. And you say to your boss: “Everything is normal. We have an infinitely scalable system. We add servers, and everything works great. ” And the boss replies: “Great, but we are losing money. We need to do something about it. And if we do not solve the problem, then we have to close the entire business. Because, despite the growth of the business, we are growing at a faster rate in terms of databases and servers. ” And you, and not the financiers, should solve this problem, because it lies, perhaps, in the technological plane. What to do next? Amazon is much more expensive. Optimize? You have already optimized all requests.

The output can be caching data that is often selected. They can be kept in some kind of cache and constantly returned from there without referring to numerous replicas and shards.

Cache issues

Well, the problem is solved: one memcached replaces us with a whole rack of replica servers. But you have to pay for everything.

The application writes both to the cache and the database, which are not replicated among themselves. Thus, there is a data inconsistency. For example, you write first in the cache, then in the database. For some reason, writing to the database failed: the application crashed, the network blinked. Then the application returned an error to the user, but other data is already in the cache. That is, there is some data in the cache, and others in the database. Nobody knows about this, the application continues to work with the cache. And when it reboots, the data is lost, because there is another copy in the database.

The funny thing is that if you write in reverse order, then the same thing will happen. They wrote it down to the database, but failed to write to the cache. We work with old data from the cache, in the database new data, but nobody knows about it. The cache has rebooted - again the data is lost. That is, in both cases, the update is lost. And this suggests that you lose a certain property of the database, namely, a guarantee that the recorded data is stored in it forever, that is, the commit is no longer a commit. You can cope with data inconsistency by writing a smart cache so that the application works only with it. It can be write through, as long as the application does not work with the database. First, the cache must write the data to the database, and then to itself. If for some reason the data is not recorded in the database, then they should not be recorded in the cache. Thus the data will always be synchronous. The data cannot fail to be written to the cache, because the cache is memory, and the memory always goes to the memory, except when the memory is beaten, but in this case the smart cache “crashes”, taking all cached data into non-existence, which is bad, but will not lead to data sync.

But still, there remains one rare case in which the data is not synchronized: the application writes to the cache, the cache writes to the database, the database inside has made a commit. Then it confirms to the cache the successful completion of the operation, but at this moment the network is broken, and the cache does not receive this confirmation. He believes that the data in the database is not recorded, and does not apply them at home. But in the database, they still applied. The application works with old data, then the cache is reloaded - the data is different again. This is a very rare case, but it is possible.

And most importantly - smart cache does not solve the problem of sharding. And your boss does not like sharding, because it is very expensive, because you need to buy many, many servers.
Among other things, the introduction of the cache does not save us from sharding, because the recording is not accelerated. Each commit must commit somewhere, and not in the cache.
The next problem: the cache is not a database, but a regular key / value storage. You have lost requests and transactions. Indexes and tables are also lost, but they can be built in half with the sin on top of the key-value cache. Therefore, the application has to simplify and dramatically redo it.
The fourth problem is “cold start”. When the cache only rises, it is empty, there is no data in it. Then all selects go directly to the database past the cache, because there is still nothing in it. Accordingly, it is necessary to add replicas again, at least not in full. We need to warm up the cache somehow. And when it warms up, many selects go to the base. Accordingly, you have to keep a number of replicas just to warm up the cache. Doesn’t it, it looks quite wasteful? But without these replicas you can not start normally. Let us consider in more detail the solution to this problem.

Cold start

At one time, such an idea arose: for the data to be always “warm”, you just need not “cool” them. For this, the cache must be persistent (persistence), that is, you need to store data somewhere on the disk, and then everything will be fine. The cache will start and load data. But then a doubt arose: the cache is RAM, it should be fast, and when a disk is given to it in a couple, wouldn't it be as slow as a database? In fact - will not.

The easiest way is to “persist” the cache once every N minutes, completely dump it to disk. This dump can be done asynchronously in the background. It does not slow down any operations, does not load the processor. This allows you to speed up the warm-up many times: when the cache rises, it has its own data dump already at hand, it reads them linearly and very quickly. It turns out faster than with any number of database replicas. But it can not be so easy, right? Suppose we do a dump every five minutes. And if a failure occurs in between, then the changes that have accumulated since the previous dump will be lost. For some applications it does not matter, for example for statistics, but for some it is important.

The second problem is that we load the disk well, which may be required for something else, for example, for logs. During a dump, the disk will slow down, and it will happen indefinitely. This can be avoided if, instead of regular dumps, you keep a log. The question should immediately arise: “How is that? This is a cache, it is fast, and we journal everything here. ” This is actually not a problem. If you write a log to a file sequentially, on a regular hard drive, the write speed will reach 100 Mb / s. Suppose an average transaction size of 100 bytes is a million transactions per second. Obviously, we will never rest on disk performance when cache is logged. This solves the IOPS problem: we load the disk just as much as it is necessary for all data to persist. The data is always fresh, we do not lose them, and the warm-up is fast.

But journaling has its drawbacks. When logging, updates that update the same item are not grouped into one record. There are a lot of them, and at the start, the cache has to “play” all these records, which may take longer than starting from the dump. In addition, by itself, the log can take up a lot of space, not even fit on the disk.

To solve the problem, you can combine both approaches - dumping and logging. Why not? We can dump, that is, create snapshots, once a day, and always write to the log. In snapshot we save the ID of the last change. And when you need to run the cache, read the snapshot, immediately use it in memory, then read the log, starting with the last change in snapshot, and use it on top of snapshot. Everything, the cache is heated. This is done as quickly as if we were reading from a dump. So, we have dealt with a cold start, now let's solve the other problems in our list.

The remaining three problems

The database has a property such as durability, which is provided through transactions. The database usually stores hot and cold data. At least, once we reached the cache, then we have the data exactly divided into hot and cold. Usually there is a lot of cold data, and very few hot ones. That's the way life is. We replicate and shard the database on many, many copies and shards only to serve hot data. We can say to ourselves: “Why are we copying all this? Let's shuffle only hot data. ” But this does not help, because we have to use exactly the same number of servers, because we shard and replicate, not because the data does not fit in the memory or on the disk, but because we run into the CPU. That is, the base does not have time to process. Thus, sharding and replication of only hot data does not solve this problem. And the boss is still angry because you have to pay for all new servers.

What can be done? We have a cache, but the hot data in the database does not allow us to live in peace, we deliver their cues and shards. However, the cache also stores data, as well as the database. If you wish, you can replicate it. What prevents us from using the cache as a primary data source? Absence of such features as transactions? We can do transactions. Thus, we solve the other three problems, since the hot data can not be stored at all in the database, only in the cache. Sharding also becomes unnecessary, because we will not have to cut the database on many servers, the cache successfully copes with the load, including the write. And it manages to write because the cache works with the log as fast as it does without the log.

So, in the cache, you can embed all the properties that are inherent in the database. We did so, and the resulting child was called Tarantool . According to the speed of work on reading and writing, it is comparable to the cache, while it has all the properties of the database that we need. Thus, we can refuse the base for storing hot data. All problems solved.

Features and features of Tarantool

So, we replicated and shaded these numerous cold data only in order to process small hot data. Now the cold data, rarely requested and modified, is in SQL, and the hot data is sent to Tarantool. That is, Tarantool is a base for hot data. As a result, for most tasks, two instances (master and replicas) are more than enough. Although you can get by with one, because the access pattern to it and the RPS is the same as that of a regular cache, despite the fact that it is a database. For some, this is a psychological problem: how can one abandon the database as an authoritative source of data storage with its cozy durable with transactions and go to the cache? In fact, starting to use memcached or any other cache, you have already abandoned the benefits of the database. Think about inconsistency and the loss of updates. And from this point of view, Tarantool not only speeds up work and allows you to save money, it takes you back to the world of databases with transactions, queries, indices, and so on.

A few words about the parallel operation of transactions. When Lua is used in Tarantool, it treats it as a single transaction: it makes all reads from memory, and transfers all records to a temporary buffer and at the very end writes to the disk in one piece. And while the data is being written, another transaction can already read old, incomplete data from the memory without any locks! The queue of transactions can occur only if the bandwidth of the sequential write to the disk is exceeded.

How do we shift from hot to cold

This process is not yet automated. We analyze the logs and determine that the data with such a pattern can be considered hot. For example, user profiles are hot, which means we shift them to Tarantool. It is clear that in passing we will catch cold ones, because some users, for example, no longer go to the Post. But, despite the overspending, it turns out more efficiently than when using MySQL. If only because Tarantool has a very optimized memory footprint. A very interesting fact: the SQL database stores everything on disk, but 10–20% should be cached in memory. But at the same time, traditional SQL DBs have a footprint three to five times worse than Tarantool, so this 20% turns into 100%. It turns out that with a similar load, the SQL server does not benefit even from memory, although it does not hold this load.

Tarantool vs redis

From our point of view, there are two key differences between Tarantool and Redis.

According to our tests, Tarantool is 30% faster. Test results are presented on the Tarantool website and in this article .
Tarantool is a database. There you can write server side scripts on Lua. Redis also has Lua, but it is single-threaded, blocking, you can write your own scripts, but their scope is very limited. In addition, Lua in Redis is not transactional. In this sense, Tarantool is perfect. It is faster and allows you to use transactions wherever necessary. There is no need to get the key from the cache, update it and put it back when in parallel someone else can change.

One million dollars

This amount is not an invention for an attractive title, but really saved money in one of the Mail.Ru Group projects. We needed to store user profiles somewhere. Before that, they were in the old vault, and we were thinking about where to move them. Initially, we considered MySQL. We deployed 16 replicas and MySQL shards, began to slowly duplicate in them the load from the profiles for reading and writing. Profiles are small pieces of information (from 500 bytes to kilobytes) that store your full name, the number of sent letters, various flags and service data that you usually need on each page. This data is often requested and updated. At 1/8 of our entire load, the farm of 16 MySQL has broken down. And this after all the optimizations we did there. After that we decided to try Tarantool. It turned out that he was quietly holding the load on four servers, which had previously been distributed among 128 servers. In fact, even on one server kept, we put four for hedging. And the savings in the form of 128 servers and reducing the cost of hosting was equivalent to a million dollars.

And this is only one case. Tarantool has found application in many of our projects. For example, 120 Tarantool instances work in the Post and Cloud. If MySQL was used there with the existing load level, then it would have to install tens of thousands of servers or other SQL - PostgreSQL, Oracle, whatever. It is even difficult to estimate how many millions of dollars this would have spilled. The moral of this story is that - for each task you need to select the right tool, it allows you to save a lot of money on the simplest feature. Cold data must be stored in a dedicated SQL database, and hot data, which is often requested and frequently updated, must be stored in a repository adapted for this, which is Tarantool.

In version 1.7, which is currently under development, we want to make a fully automatic clustering solution with sharding and replication like RAFT. Stay tuned for updates !

Source: https://habr.com/ru/post/273695/

All Articles