Pitfalls in Project Voldemort

Used in one of our projects such things as Project Voldemort .
In short, this is a very interesting implementation of a key-value storage aka NoSQL database, implemented in the depths of Linkedin. That is, you give it a key and a value, and it stores / gives it quickly in memory and also saves it on disk. It is interesting, in principle, not by this, but by its implementation of clustering, a good speed, well, that is often used in Java projects. In principle, there was no detailed review of this database on Habré, and you can do something like this. But I want to tell you here about some rakes that I had to face.
And we ran into one problem during operation, namely, with a lot of traffic to Voldemort, its base began to swell with terrible force - literally dozens of gigabytes per hour - although the developers claimed that this amount of data should not be there. I had to dig.
As a result of the “digging”, the following came to light - as a Voldemort backend, by default, it uses the so-called BDB JE - Berkeley DB Java Edition , and it turned out that this JE does not look like the usual Berkeley DB. It turned out that it is write only - that is, based on the same principle as journaling filesystems - for any operation - write, update, delete - the data is DONE in the files on the disk and by themselves they are NOT DELETED. The special cleaner process then goes and cleans outdated data - it checks the general utilization of files in the database, and if it is less than bdb.cleaner.minUtilization percent (by default - 50%), it starts checking every file, and if it has less bdb.cleaner.min. file.utilization percent (5% by default) the file is deleted, and the data from it is transferred to the new file.
Good. It seems to be necessary to play around with these parameters, but something does not look like that we had 50% recycling - so a lot of data is stored on the disk.
Checking -

# java -jar /usr/local/voldemort/lib/je-4.0.92.jar DbSpace -h /usr/local/voldemort/data/bdb -u

File Size (KB) % Used
-------- --------- ------
00000000 61439 78
00000001 61439 75
00000002 61439 73
00000003 61439 74
...
000013f6 61415 1
000013fd 61392 2
000013fe 61411 3
00001400 61432 2
00001401 61439 1
...
0000186e 61413 100
0000186f 61376 100
00001870 16875 95
TOTALS 112583251 7

# java -jar /usr/local/voldemort/lib/je-4.0.92.jar DbSpace -h /usr/local/voldemort/data/bdb -u

File Size (KB) % Used
-------- --------- ------
00000000 61439 78
00000001 61439 75
00000002 61439 73
00000003 61439 74
...
000013f6 61415 1
000013fd 61392 2
000013fe 61411 3
00001400 61432 2
00001401 61439 1
...
0000186e 61413 100
0000186f 61376 100
00001870 16875 95
TOTALS 112583251 7

Oops Does not work means cleaning. We are trying to increase the number of threads for cleaning - playing with bdb.cleaner.threads (default 1) - no use. As a result of googling, we stumble upon a thread in the forum dedicated to BDB JE (as it turned out, a very useful forum, if you are using BDB JE in any way - be sure to read it).
The thread (unfortunately, I can’t find it now) clearly indicates that the size of the BDB cache can greatly affect cleaning. That is, if the cache is too small - cleaning may not even start, since with a large number of keys it is desirable that they all get into the cache, otherwise the cleaning performance drops dramatically. The desired cache size can be estimated using the following command -

# java -jar /usr/local/voldemort/lib/je-4.0.92.jar DbCacheSize -records 1000000 -key 100 -data 300

Inputs: records=1000000 keySize=100 dataSize=300 nodeMax=128 density=80% overhead=10%
Cache Size Btree Size Description
-------------- -------------- -----------
177,752,177 159,976,960 Minimum, internal nodes only
208,665,600 187,799,040 Maximum, internal nodes only
586,641,066 527,976,960 Minimum, internal nodes and leaf nodes
617,554,488 555,799,040 Maximum, internal nodes and leaf nodes
Btree levels: 3

(where key and data is the average key and data size, in bytes, records is the number of records).
That is, for our data, for each million records, we need about 200 MB of cache - and we had more than one million records. :(
Total - after setting adequate caches ( bdb.cache.size ), the database utilization increased from 7% to the required 50% within 24 hours, respectively, the size of the database fell by 2 times.
Moral - study the technology used, even if it is used not directly, but indirectly.

Source: https://habr.com/ru/post/137704/

All Articles

Pitfalls in Project Voldemort

More articles: