📜 ⬆️ ⬇️

What happens at the junction of the database and operating system?



Alexander Krizhanovsky ( krizhanovsky )


Based on the report " Linux Kernel Extension for Databases " on HighLoad ++ 2016.

Today I will talk about the insides of the database, about the insides of the OS and what is happening at the junction. When I talk about the database, I mean mostly I / O, memory management, and transaction processing. This is absolutely not about SQL, not about indexes, not about locks, etc. Only I / O, memory management and transactions.

Good evening! My name is Alexander Krizhanovsky, I work in 2 places. In one, we are developing high-performance custom-made applications, consulting, mainly a network traffic processing system and a database. In another project - at Tempesta Technologies (this is a branch of NatSys Lab) - we are developing an open source application delivery controller. These are hybrids from security tools, web accelerators, etc. that help you deliver content quickly and reliably.

')
Today I want to tell about a part of this project. Given that this is mainly a web content accelerator, it has a lot in common with databases. But about him at the end, and I will begin the presentation with my very long-time unsuccessful experience in building my own lightweight database.



Then, many years ago, I had the task of building a very fast storage for the history of the Instant messenger. This store had to keep the messages of users on a key of 3 components: the sender, the recipient, and at times the label of the message. Great performance was required. We started with the fact that consistency is not important to us at all, we wanted to do all this on the file storage, and then we came to the conclusion that it’s not good to lose user messages, and we wanted some consistency. And since we started with file storage, we thought that 2-3 months is enough to do something on the files. The deadline was very small, and as a result it was necessary to do it all quickly.


Today I will talk about the insides of the database, about the insides of the OS and what is happening at the junction. When I talk about the database, I mean mostly I / O, memory management, and transaction processing. This is absolutely not about SQL, not about indexes, not about locks, etc. Only I / O, memory management and transactions.

In those places, when I will mention the database, I will refer to InnoDB, which is most familiar to me.



If we look at the transaction database engine, we will see about the same picture. We have a pool buffer, live index pages, data pages. All of these pages are output through the transaction log, when they have changed, it passes all through the I / O scheduler in either direction. The I / O scheduler, when it reads something from a disk, it reads ahead. All this goes through the file system, but it’s on the disk.



Now let's take a look at how the Linux I / O subsystem is organized. Here, on the left. And in principle, in the OS. And we see a very, very similar picture, i.e. we have page cache, some pages are stored there, they are reset via the file system, via VFS to the file system. Along with the file system, a transaction log is usually supplied and all this also goes through the I / O scheduler, and there too different smart things happen.



Traditionally, databases were built on the O_DIRECT system call. This is a rather old squeeze from Linus's letter. In general, this is not welcomed by the OS developers, and for a long time we have a misunderstanding between the OS and database developers - what should be.



At that time, I decided that we would not try to throw out the OS, like most database developers, and we will make the most of its interface, since we have little time. We have a great mmap system call that will help us. He has all the features we need, which we saw now in the left picture. First, it replaces the page. When a page is pushed out, it is written to disk, and it is written atomically, the file system provides this. Further, mmap somehow in the background drops the pages, and we get persistence. The I / O is managed by the OS, it has various advanced I / O schedulers - this is also good for us, and we have such a nice feature that we even get indices for free.



Let's look at the latest feature. Mmap is just a continuous memory area, large enough. Accordingly, there are virtual addresses, there are physical ones, and all of our memory calls that occur must go through the virtual address transaction to the physical one. For this we use the table page table. She is engaged in this rezolving. We have a 48-bit top key, this is the address. From it only 36 bits are used. And here we go through, bite off 9 bits at a time from this key and go through the tree. We get some such radix tree.



If this picture is revised, it will be approximately what is shown on the left. We have the first 9 bits of the key determine the index in this array, then we go here by a pointer to the next one, and so on. The structure is called radix tree. And in general, if we build some binary tree in RAM, usually a binary tree, say, some STL map that uses a red-ebony tree, we have on every search for items if we go down from the tree and want to go to a green element from blue, then we should generally split the address and, accordingly, go through all our big tree, make 4 memory accesses. It turns out that, in order to go through each element of our applied tree, we have to go through another page table tree.



If we look at the idea that we have an index out of the box, we can allocate a continuous memory area to some fixed address map'om and then take the keys in a fixed range, put them on the data bus, and now we will have a resolving page table.



If we take a closer look at this, then this is really nothing more than a simple ordinary SI array, i.e. a large SI-shny array is actually a tree.



Little, perhaps, can be done with this, but what is interesting here is that virtual memory is not cheap for us at all. The first thing I want to say is the 3rd bullet — that if we use virtual memory as a simple index out of the box, then if we use keys with poor locality, i.e. strongly distributed keys, we will have everything for a very long time, because we will use a lot of memory. Those. we store one element, and use the whole page where it lies, and each element of this tree is a page. The tree is very expensive.

At us this tree is cached in TLB. TLB is small and, unfortunately, we in the ontext switch should completely disable the first level cache completely, because we can have the same virtual addresses and, accordingly, all of these addresses fly out. But the good thing is that the user-space / kernel-space / sitext switches it leaves the caches intact, because the kernel and user-space work in the same page table.



Actually, at that time, when I started to deal with mmap, I understood the following thing - that, firstly, if we use big data, we need to keep spatial locality as much as possible, i.e. you can’t just over-mmap a large area and how to use it, you need to use the keys as much as possible locally. Large maps are quite expensive, because if you mmap a large area, you spend a lot on a table of pages, i.e. this whole area must somehow be addressed.



In fact, I am not the first who, in general, tried to reuse the OS mechanisms using mmap. We see that even 35 years ago, people asked the same question - that somehow it turns out badly that, it seems, we have all the mechanisms, but operating systems do not meet the requirements of database developers.

Michael Stonebreaker formulated these requirements, they are quite interesting, and it’s interesting that virtually all but the last one are still relevant. Those. During this decade, the processes have become quite easy, we have threads, we have changed a lot, mmap appeared (ie, this article even before mmap appeared). The rest, in principle, more or less relevant. Those. the first is that when we do mmap, we force out, we don’t know when which page will be ousted, in what order, etc.

Further. Usually the filesystems we deal with are not structured, they simply store sequences of bytes. We have absolutely no real transactions. Those. on the one hand, we say that we have a transaction log, on the other hand, we do not have transactions in the sense of databases. And there is a problem with the continuity of block allocation on the disk.

The last 3 problems are now more or less solved, and the first 3 have not been solved, and they have not decided. Let's see it from the end.



The first is about data locality. The extents of the file systems help us, i.e. in modern file systems, they allocate space on the disk in fairly large chunks and solve the problem of the fact that our disk file block fragmentation is rather low.



Accordingly, the fallocate system call helps us here. If we create a large file, if we make a large write to a file, then we can reserve a sequential area on the disk and write data there.

Since our file systems not only allocate data to an extent, but they also address them, most file systems index their blocks through trees, through B-tree, hashtray, and the more data blocks they index, the lower this tree is. Accordingly, if they index large extent'y, then the trees are quite low. If we allocate a couple of fallocate files, then we have an even smaller index. Thus, in fact, we have a tree, which the file system has for addressing blocks, is very low, and this problem is leveled, too.



Now the most interesting thing is how transactions are processed. Here I schematically tried to draw what transaction processing in InnoDB looks like. Let's look at the fact that here we have two dirty pages that have 2 entries changed in red. And the most important thing for us is that InnoDB uses the so-called policy. steal and no-force, this means that these pages can be flushed with the disk at any time, i.e. until the moment of commit. This means that they are stolen, this is a steal-strategy, and no-force means that when you commit, it’s not necessary that you have all your data on the disk. So we use undo, redo log'i in order to roll back. Undo log - roll back the data that has been flushed to the commit, a transaction that has not yet committed itself, and redo to recover.

It is also interesting here that there is undo log - this is the basis on which multi-versioning works. And it stores old versions of the record. In general, for a database, its transactional I / O level should be aware of the record, i.e. we cannot implement a sufficiently flexible strategy of fashing without knowledge of the semantics of the data that we store, without knowledge of the record.



In principle, I have already said that we can not guarantee when the page will be recorded, we can not talk about records. And let's see, for example, that we will have some two transactions that modify completely different pages, but on one of the pages they modify different records. We have Linux VMM does not know about the record, accordingly, when we have the first transaction commited, we can say that we have changed the 3rd page. But we have to flush it. Suppose we flush it, these first 3 pages that were modified by the first transaction. But, when we drop this page, we do not know whether another transaction has managed to make changes in the 3rd page or not. And we together with the first transaction can flush the changes of the 2nd transaction, but we can not flush. And the worst thing is that we do not know this at all and cannot even find out.

We understand that a modern database implements more complex mechanisms than the file system, but maybe we can bite off some of the pieces from the transaction database engine and replace them with file system mechanisms? Let's run a little now on how the file system is organized and then we will try to answer this question.



The first thing I want to say about file systems is the popular file systems that most use, this is the so-called. log-extended file systems. Those. there is a regular file system, a log is attached to it. When we make a record, we first write to the transaction log and then write to the disk. Accordingly, we get double entry required. First, the transaction log is written sequentially, so we are good, it is small, it is consistent, and secondly, since we have already written to the log, we can postpone the recording of changes, download them and drop everything together in a large piece.



The problem of double recording was not liked by everyone, and people tried to get away from double recording altogether. They said: “OK, let us give up the file system structure and build it all as one big transaction log. We have changed the data, we will add them to the end of the consistent log, which is growing and growing further. And the piece of the file changes, we add it to the end. Now we have an inode, which refers to our initial block, should be changed, so we also need to copy it to the end of the transaction log. And since we now have inodes floating around the disk, we have somewhere else to hold the inodes card, which we also need to update somehow with these pointers to inodes at the end. And it turns out quite a complex structure with complex management, showing very poor performance. For example, Nilfs is still in the Linux kernel, you can take it and run it to try.



The continuation of this idea (something between the first and second) became Copy-On-Write file systems. This is modern ZFS and BtrFS. They use the same approach. Since both file systems allow snapshots, they use the following property. Suppose we have a yellow tree initially, then we must add element 19. We take, copy this block, append to it 19. Now we have to refer to this block from root. Root we also copy and assign the link. These leave. And at this moment we are left with 2 elements that will be released once. Due to this, we can make snapshots. Here we have an idea for a log-structured file system, i.e. we just add new changes further.

And here we have the following property. If we have a database file that is approximately the same size, for example, you have 100 thousand users / clients with whom you sometimes change records, etc., but approximately your files in the database will be the same, slowly grow but there will be a high update. Such a file system will allocate disk space all the time and will twist the disk all the time, and there will also be problems with fragmentation, because these 4 blocks were consistent on the disk. After we write these blocks, link them, we will have a hole in the disk, there is a problem with locality. On the other hand, if we keep adding new and new data all the time, then such a file system is the place to be. She writes only once the data, she does not write them 2 times to disk. Therefore, BtrFS is probably better for OLAP loads, when you periodically upload large amounts of data to disk and unload their sequences. If you have a large update (OLTP-load), then the classical file system is better than usual.



The last thing I wanted to mention is the FreeBSD file system. It says that we can not use any logs at all, but we can carefully update. If we, for example, want to add a new block to a file, then we first allocate the block where we allocate it, put it, and only then we have to write a reference to it in the inode. If we fall off somewhere in the middle, well, we will lose the first block, but the consistency will remain. On the other hand, if we first change the inode, which refers to the new block, and the block is not yet allocated, then the consistency of the file will fall down. So it works, but, of course, the file system, again, does not know any data, it can maintain only the consistency of its structure, it cannot maintain the consistency of data. In principle, the XFS and EXT4 file systems ... XFS allows you to only do a meta-date, EXT4 can be configured to log the data too, i.e. EXT4 allows you to store more consistent data.



And we got to the question: can we still throw out the database mechanisms and use the file system instead? Yves Trudeau from Percona wrote a wonderful blog about how he tried to pull out the doublewrite buffer and replace it with the transactional file system log. And this blog, as far as I remember, in general, in 2 parts. The first part is that everything is fine, everything turned out well, the second - that nothing happened.

By the way, we posted in open access the video of the last five years of the conference of developers of high-loaded systems HighLoad ++ . Watch, learn, share and subscribe to the YouTube channel .

The problem is that we cannot write big data. Those. let's take a look at what we are writing right now with you ... Making a big write, calling one system call, which should write us 3 blocks. We write the first, we write the second well, then we fall off. The file system will not roll back the changes to us first. The file system transaction log is a very small file, small. On the other hand, we can call write for many, many GB, for example, try to save some large image blu-ray. And we have no place to save all this data somewhere. We can sallocate a large file and write data there. We also have nowhere to save it all. The file system is the maximum that can, even in the maximum version of EXT4, which should ensure the consistency of the data — we will write some part of the data transferred to the write VM, and some part not. Something we have rolled back, but something is not. Generally speaking, it is a profanation - the consistency of the file system data.



Again, after the conclusion that the file systems are not structured, they do not know anything about the data, they do not provide consistency.



Let's look at how data is being pushed out of us, flashing from memory to disk. Linux works with two lists - active / inactive. If some new page comes to us, we read it in memory or have changed it, and while we change it, it is in the active list, and as soon as it becomes outdated, it is replaced in the inactive list by timeout. If she somehow stays in the inactive list, from here she flushes onto the disk, and from here she can climb with the allocator. The problem here is what? That these are two linear data structures with sequential access. We can have a lot of memory, modern servers - 128/512 GB of RAM, we divide it all into 4 KB and we get a lot of pages that will be on the linear list. Accordingly, if we have access to our data, memory is changing all the time, we change some pages, then we blot others, we add it to these lists, the inactive list can also swell by some rules. At some point, we can take a page, again refer to it, it will move here again. And this set all the time at the page can circulate between these lists, when the page still flashing onto the disk, it can be raised from the disk again during access, etc.In general, when the Linux kernel does not have enough memory, when it has to scan these lists, everything becomes very slow. By the way, kswapd is a single-threaded daemon, plus it must lock these lists when it runs over them in order to maintain the consistency of the data structure with respect to changes that other processes for inserting and extracting pages make. Well, in general, everything is very bad.



We have several system calls that help us synchronize data from an open file, say fsync, or from mmap. But the problem is that if we carefully look at all these calls, we have an address, there is a length, offset, nbyte. We can reset a continuous range, we can not say: "Please drop us the 1st, 3rd and 5th pages." So we can not. If we reset and make one big write, then we cannot write it atomically. And we do not control when the Linux kernel throws a certain page - maybe before this sync, and maybe after sync. When we mmap a file, we generally do not know anything about the state of the file on disk, which data is synchronized and which is not synchronized. This is bad.



Accordingly, we can slightly affect the processes that occur in the ousting from memory inside Linux, we can give advice, for example, advice that a specific page is not needed, if we mark a page in mmap as unnecessary, then most likely its core before throwing out. But, again, when it throws it out altogether, we are not given any guarantees either.



, . – , , , , . , , , , fsync , , , - , . table spase . , .



Let's go through the survey, which is done in the field of file systems. First of all, it is Reiser4. They added transactionality, but again transactionality for a small record. In my opinion, this is 4-32 KB, atomicity was given. They have some sort of transaction support, but big data cannot be stored consistently, with a guarantee.



BtrFS gives us transaction interfaces. We can say that we need to start a transaction, start an operation with the file system, in my opinion, this is not only read, write to a specific file, but all sorts of file transfers, deletions, etc. Then finish the transaction. There are problems here, and this interface, which is not recommended for use, is that if we start a transaction, then our process dies, then no one else can access these files, to which it has logged. Sotransaction does not roll back.



There is a Windows TxF. She, it seems, was in some version of Windows, probably, even now there. It is declared as deprecated, too, as far as I remember, for the same reasons as in BtrFS. And there is also a Recerchevskaya fairly fresh file system that allows you, at the highest, abstract level, to make transactional system calls over files in the semantics of transactions. We also start a transaction, commit, roll back, etc.



Further, we still have the OS of 2009, which says that “why don't we, in general, make all operations in the OS transactional?”. Those.we can put any sequence of system calls under the transaction, any rollback, and for this purpose everything is patched there. All data structures when you change. For example, you have a task_struc t structure, if you change a process, for example, change some data on file descriptors or something else, you copy this whole structure, mark that as old copy, new as a new copy, and work with these copies. Then remove them. And in general, you can roll it back.



A more interesting example for us is the msync project. It does quite small things, it just gives the transaction interface to mmap. It allows you to overwrite the memory area with the new MAP_ATOMIC, msync patches, which allows you to synchronize only the changed data, and not in the sequence, but only those that were recently changed. Add REDO logging and patch reset pages. Those.quite small changes, but which allow us to work atomically with the mmap region.



The last item we had that Michael Stonebreaker addressed was that we have unstructured file systems. In fact, we have file systems that are structured, which give you Record-oriented interfaces. And in RMS this is, in my opinion, it was in the IMB-mainframe'ah, respectively, in the mainframe'ovskih file systems we have file systems that you can open sysopen-specific system calls. Those.not like open, read, write in Linux, but using special system calls to open a file, for sequential access, for access by index, etc.

In fact, it works for you, when you close a file, you have to reset it all. In fact, it looks like a database, it smells like a database, it is essentially inside and there is a database. You have an index, you have a transaction, you have an interface for adding this entry, going through the records, etc. In essence, this is an embedded database that is made at the VMS core level. And, generally speaking, this is not the 80s, this is quite a modern thing, and now you can download documentation on this, this is documentation of a modern OS.



We are getting close to our decision. We also make a database, but we needed a database, because we are doing a web accelerator, the accelerator must store the content of the pages. There is, for example, Apache Traffic Server, which implements some kind of database, it does not store a label in every single file, but in a specially allocated space that can be effectively split and addressed.

We also use databases, i.e. The main thing we use TempestaDB for is the file cache, well, the server response cache. We also keep filter rules, blocked clients, user sessions, etc. in the same place.

Tempesta itself works inside softirq. This is a completely nuclear solution, and now we have started working to open these interfaces and to do something more general that could be used in new database applications. We ourselves now needed new interfaces to access databases in order to be able to conveniently manage the web cache and, accordingly, we began to lack nuclear databases that are difficult to access from user space. All this is now only in development, now it lies in a separate branch on github.



Let's see what is being done, how all this farming is used. We have a user space-utility, which allows you to do some queries to the database, tear off tags, etc., i.e. simple enough. The rest is all that is done - it is at the kernel level in our database, the filter stores the filtering rules, the web cache is stored here, logging will be stored.



The interface that is now in the wizard is such an interface, i.e. we have this entire database inside the kernel, and we use the netlink-mmap-interface to send a request through the socket to the kernel. The kernel twists the query here and then gives the results to the user space. The problem is that netlink-mmap itself can dump any kernel buffer into your user space, i.e. using zero-copy we can transfer all data from the kernel to the user space. The problem is that in the database we have a lot of things. If you give a request, for example, “give me all the server’s responses for any period or to some index”, we can’t just take the database space in user space, as we have in these pages will be both entries that match the query, and entries that do not match, first. Secondly,These records can change right now online. And we act in such a way that we here, on the sidelines, allocate new pages, collect the results of the execution of queries there, and then we map these pages without copying into user space. After all, we wanted to be completely without copying, so that both the user space process and Tempesta themselves worked with one buffer inside the Linux kernel as with one common database, i.e. we get a multi-threaded database with the same data.Tempesta itself worked with one buffer inside the Linux kernel as with one common database, i.e. we get a multi-threaded database with the same data.Tempesta itself worked with one buffer inside the Linux kernel as with one common database, i.e. we get a multi-threaded database with the same data.



, . -, . , , , , huge tdbfs. , .. Tempesta – , softirq. Those. – , , , , in memory, , , , . .

Now the new file system is filing, which should give the interface to TempestaDB. Actually, through this interface we will have our own mmap, similar to msync. We are now using the strategy of no-steal force or no-force. No-steal, I remind you that this means that we cannot flush the page to disk until there is no commit. And fore, no-fore means that if we have a fore-strategy, then when the commit passes, we are obliged to sync everything to disk. If no-force, then the commit has passed, but sync may not be present.



Let's see what happens. First, how it is organized now. There is a segment of these large pages that are allocated. We cut off a small area from him at the beginning. This is a transactional segment, here we have mmap interfaces, a thin virtual file system, and the application layer goes through these interfaces. It turns out that the entire database is mapped in user space through mmap, at the user space level both the index and the data are visible. And a separate management is output in the thread that deals with synchronization.



See if we have some kind of transaction that changes 2 pages and changes 3 records. We changed the data in huge pages, we don’t flush it, i.e. All this is built into the VMM Linux system, but this special area is not reset in the background.



Now, when we have registered this data, and the commit comes to us from the user space, we copy all this data into the transactional segment, collect them sequentially and by sequential recording in the pending stream, then we will reset it all. Now we are not yet fired.



Now that the commit has passed, we’ll flush all the data.



In another scenario, if we write, do some kind of transaction from the Linux kernel, then all the data that we are changing here cannot be reset inside the softirq. We have to ask for a separate thread that will drop them to us.



. -, , . , .. , , , .



, TempestaDB. , NUMA-, , , , TempestaDB , , NUMA-.

In TempestaDB, we have several modes of operation; if you have very large content that you accumulate in the cache, then it makes sense for each processor to go to a remote node. Let it be slow, but we store the data only 1 time. On the other hand, if you need very high performance, but there is little data, we can copy the data to different processor packages in their local memory, and each processor will work quickly enough with their local area.



Actually, this is the mode when we have little memory, a lot of content, we break it up into different areas, when a request comes in, it is shy at one or another processor package.

And replication is when we completely copy all the content.



We use a fairly specific index, this is a Cache conscious data structure that knows about the cache, and we try to cram as much data as possible into each cache line, and the whole algorithm works with cache lines.



This is how our tree works, what a burst hash trie is. This is a small hash-plate, a total of 16 elements. When a key comes, we calculate the hash from it, take the last 4 bits, and 4 bits determine the place in this hash table. Now the hash table is empty, so our key will be stored right here, i.e. as in the array.



Further, now if we have one more values ​​come, then we can already stuff small values ​​into a batch, not more than one page, work further with it as with a list of a regular hash table.



Then, when we have a place in this bucket, we have a blast explosion.



This means that we grow the 2nd level of the hash-table. If we have a collision, we take on the first element the index in the first node, on the second element we define 2 others, i.e. it roughly works like a radix tree, only it has a non-constant height. And we have to transfer all values ​​to new buckets.



Now about the transaction. When we have a change, we write it to the transaction log inside the nodes of the tree. All this is written to the transaction log, the record logs themselves are also written to the transaction log, and when the record was added completely, the index along with the data is consistently placed on the disk.

Thanks to all!

Contacts


» Krizhanovsky
» ak@tempesta-tech.com
» Project source code
» Blog

This report is a transcript of one of the best speeches at a professional conference of developers of high-loaded systems HighLoad ++ .

Now we are already preparing a conference for 2017, the largest HighLoad ++
in history. If the cost of tickets is interesting and important to you - buy now while the price is not yet high!

By the way, we will definitely call Alexander Krizhanovsky’s lecturers - today he is one of the best professionals in the industry.

Source: https://habr.com/ru/post/335260/


All Articles