Architectural Flame CouchDB

My favorite topic in programming is to delve into the negative effects that present to us the most, in our opinion, trivial operations.

One such issue is the deletion of records in the database. This operation, according to most programmers, speeds up the work with the base and makes it more compact. The trick is that this is not true. And if with relational databases this is not only partly true, then with NoSQL this can be a complete lie.

We will discuss this problem in Apache CouchDB later.
Picture in topic:

How data is stored

Data in any database is stored on the principle of the file system: there is a data allocation map and a file in which they are directly placed. For SQL, this is usually a table, and for NoSQL, it is usually a tree.
When we delete data, as in the case of the file system, the database will not waste time re-creating the map file and data file without the record we want to delete. It will simply mark the entry as deleted in the map. It’s easy to make sure, for this we will create a simple table in MySQL using MyISAM, add one record there, then delete and look at the statistics:
MySQL table with overhead

To optimize this, we need to re-create the map file and data file. Perform:
OPTIMIZE TABLE guest;
and get:

')
Interestingly, the table in a non-optimized version, oddly enough, works almost as fast as in the optimized one. This happens because the very principle of storing relational data is usually quite simple and it is easy to calculate how much you need to do to ensure that remote entries are skipped. The above, of course, does not mean that overhead can be ignored, but it suffices to write the simplest bash script that will optimize the data on a schedule, and no additional work in the program code is necessary.

It should be noted that the above is not quite suitable for InnoDB, where there are even more nuances, but today the article is about CouchDB, and not about MySQL.

How deletion works in CouchDB

Take a simple document:

and remove it. What's happening? The database marks the document as deleted. How does she do it? It reads the document, removes all the fields from it, inserts the additional _deleted: true property, and writes the document under the new revision. Example:

Now, if you try to get the latest version document, you will get a 404 error indicating that the document has been deleted. However, if you refer to the first revision of the document, it will be available.
Next we do compact . For deleted documents, the database will automatically delete all revisions, except for the one that says that the document has been deleted. This is done so that when replicating, tell the other database about it. This revision remains in the database forever and cannot be deleted. (True, you can use _purge, but this is a crutch with a lot of negative effects and is not recommended for production.)

How does it affect the work

In CouchDB, data is stored as a B + tree . A deleted document, even though it is a dummy, remains part of the tree. This means that this dead entry is taken into account. And it is taken into account not only when building indexes, but also during the usual insertion of a document, since a tree can be rebuilt when inserting a document, and the more records there are, the slower this process will be.

Checklist in the head

Finally, it remains to understand how fast Erlang is. Now, if you take synthetic tests , it is clear that Erlang's performance is close to PHP. That is, B + tree manipulates not the fastest language.

How it slows down in reality

If WRITE is not a rare operation with you, then, having several million documents in the tree, you unexpectedly (and it is unexpectedly) can find out that the base starts to slow down a lot. For example, you use CouchDB to store documents with a low lifespan (sessions, lock-files, queues). Let's take a chart from a real production:

From the graph it is clear that the peaks are pretty sharp. A sharp increase in peak is not always predictable. Sometimes we have about 2 million updates of the database (about 1 million documents in the tree) and it works quite well, but another 100 thousand appear, and the performance flies into the pipe. A sharp decline in the peak occurs because we re-create the base and the performance for several weeks becomes acceptable.

findings

CouchDB stores all documents in the B + tree, which is periodically rebuilt. Erlang is not the fastest language for this. Do not use CouchDB for documents with low longevity, otherwise you will have too big a tree, because documents are never deleted from it.
Even if you do not delete documents, you will get a lag to add new ones when you have several million records.
I advise you to pay attention to article 16 practical tips on working with CouchDB .
It becomes clear why Damien Katz, the creator of CouchDB, decided to fork CouchBase and rewrite the kernel in C. By the way, CouchBase contains built-in memcached, which allows you to store documents with a low lifespan in a separate area.

Source: https://habr.com/ru/post/139325/

All Articles