📜 ⬆️ ⬇️

Learning the file system to read

What will happen in this article


image

We continue a series of articles on creating a file system in the Linux kernel, based on the materials of the OS course at the Academic University .

Last time, we set up the environment we need to get to know the kernel. Then we took a look at the loadable kernel modules and wrote a simple “Hello, World!”. Finally, we wrote a simple and useless file system. It's time to continue.
')
The main purpose of this article is to teach the file system to read from disk. For now, it will only read service information (superblock and index nodes), so it is still rather difficult to use it.

Why so little? The fact is that in this post we need to determine the structure of our file system - how it will be stored on disk. In addition, we will encounter a couple of interesting points, such as SLAB and RCU. All this will require some explanations - a lot of words and a little code, so the post will be quite voluminous.



Boring start



Our file system should be simple, so it will be stored on disk simply:



Let's start from the end:



Actually similar disk layout is used in other file systems. For example, a group of blocks in ext2 / 3 has a similar structure, only ext2 / 3 works with several such groups, and we confine ourselves to one. This format is considered obsolete, and new file systems are moving away from it. For example, btrfs uses a more interesting scheme , which gives a number of advantages over the ext family. But let us return to our sheep.

We determined that the superblock and bitmaps of blocks / index nodes occupy the first three blocks of the file system, and how much does the table of index nodes occupy? Generally speaking, it is not correct to fix this value, it strongly depends on how the file system will be used. For example, if you are going to store mostly large files, then it makes sense to make this table smaller, because it is unlikely that you will run out of index nodes earlier than disk space. On the other hand, if you intend to store many small files, there is a chance to exhaust index nodes earlier than free disk space if the table is too small.

In the mke2fs utility, which is used to format a disk for ext2 / 3 file systems, there is a -i switch that indicates how much disk you need to create an index node, i.e., if you specify -i 16384, then for every 16 kilobytes of disk space created by index node. I will use the simplest option - I will create an index node for every 16 KB of disk space, without the possibility of changing this value (for now at least).

The last common point to be affected is the block size. The file system can work with blocks of different sizes, I will support blocks of 512, 1024, 2048 and 4096 bytes - nothing unusual. This is due to the fact that it is easier to work with blocks that fit into the page (we will return to this a bit later), but it is not necessary to do this at all, moreover, large block sizes can contribute to greater productivity.

In general, choosing the right block size for classic file systems is a rather interesting topic. For example, in a well-known book on OS , information is provided that, with a block size of 4 Kb, 60 - 70% of files will be placed in one block. The more files fit in one block, the less fragmentation, the higher the read speed, but more space is wasted. In our case, everything else, the block size is the main file system limiter - with a block size of 4 KB, the bitmap of free blocks can cover only 128 MB of disk space.

Back to practice



It's time to write some code. Let's start with the structures that we will use, starting with the simplest:

struct aufs_disk_super_block { __be32 dsb_magic; __be32 dsb_block_size; __be32 dsb_root_inode; __be32 dsb_inode_blocks; }; 


The structure of the super block is stored at the very beginning of disk block 0. It begins with a magic number, from which we can make sure that it is aufs that is stored on the disk (I mentioned this last time).

Given that we will support different block sizes (which, however, will not require any effort from us), we need to know what size is used in this particular file system, and the superblock stores this information for us in the dsb_block_size field.

In addition, the number of blocks in the table of index nodes is stored in the dsb_inode_blocks field - I have already mentioned that the size of this table may be different.

The dsb_root_inode field stores the index node number of the root directory. Of course, it is not necessary to store it, you can simply use a fixed number for the root, but then the structure will be completely empty, and with it it seems more solid.

Notice that fixed-size types are used for fields. This is because we will write this structure to disk "as is". However, fix the size is not enough, you need to fix the order of bytes. I will use big endian , it’s a network byte order, which is indicated by the type name (__be32).

In principle, it is not particularly important which order to use, the main thing is that it was fixed. Although it is believed that little endian platforms are more, and therefore it is preferable to use it, but let's get back to business.

The type __be32, in fact, is synonymous with uint32_t, but its name emphasizes that the variable stores data in big endian (a sort of documentation method). In the core there is a similar type for little endian.

Now let's see, perhaps, the most important file system structure - the index node:

 struct aufs_disk_inode { __be32 di_first; __be32 di_blocks; __be32 di_size; __be32 di_gid; __be32 di_uid; __be32 di_mode; __be64 di_ctime; }; 


The index node, first of all, determines where the file / directory is stored on the disk. Ways to store files can be very diverse, I will use quite simple - one extent per file. An extent is a continuous sequence of disk blocks, i.e. each file / directory will be stored in a continuous sequence of blocks. The di_first and di_blocks fields store the first block and the number of blocks in the extent, respectively.

Then you say, and how in such a file system to append data to the end of the file and add entries to the directory? Indeed, a full implementation of operations that lead to a change in the size of a file / directory, with this method of storage is a headache (not to mention the effectiveness of such an implementation), so we will not do a full implementation of the record, but more on this in another article.

However, such an organization has some positive aspects - the files are not fragmented, and this has a good effect on the sequential read speed. Therefore, such a structure can be remarkably used in read-only file systems, for example, in iso 9660 (although it already supports file fragmentation).

It is clear that extents are not a tricky structure and do little, but along with the classic tree structures for storing files on disk, they turn out to be a pretty good option for file systems with fragmentation.

In addition to pointing out to disk space, the index node also stores the actual file size in the di_size field. And the truth is, the file size is not obliged to fall exactly in the block size.

The di_gid and di_uid fields are group and user IDs. To save such information in the file system does not always make sense, I inserted them for example.

The di_mode field - stores access rights for the group of the file owner, the file owner and all other users. Since I saved the group and the owner, then the access rights should be kept. Another di_mode stores the type of the object that describes the index node, for example, is the object a directory or file.

Finally, the di_ctime field stores the date the file was created. Usually, file systems store along with the creation date also the dates of the last modification and file access, but we will score on them.

Format the disk



So, when we determined the file system storage format on the disk, it's time to write a utility that will bring the disk into the correct format. A Linux disk is just a file (it is appropriate to recall the well-known Unix design solution ). So, formatting a disk is simply writing the necessary data to the file. And the necessary data in our case is the superblock, bitmaps, and the root directory (empty for now).

In order not to turn the article on the Linux kernel into an article on C ++ (especially in the light of Linus’s relationship to the latter ), I suggest you take a look at the source code on github yourself, but briefly go through the main classes:



The utility allows you to change the block size using the -s or --block_size keys, and the number of blocks that will be used for the file system via the -b or --blocks keys.

In general, the utility is not at all tricky, but many may rightly think that the code is too sophisticated for such a simple task. The fact is that first of all we will teach the file system to read from the disk, and then we will begin to write. But in order to test the performance of our driver, you need to write something to the disk. Therefore, we will later add to the utility the ability to import files / directories when formatting a disk, which will help us a lot.

Back to file system



Now back to our downloadable module. We begin by reading the superblock from the disk. But before that, we will create another structure for the superblock:

 struct aufs_super_block { uint32_t asb_magic; uint32_t asb_inode_blocks; uint32_t asb_block_size; uint32_t asb_root_inode; uint32_t asb_inodes_in_block; }; 


This structure will represent the superblock in memory. The idea is simple - we read from the aufs_disk_super_block disk and convert it to aufs_super_block by converting the order of bytes and compute various useful data along the way (in this case, asb_inodes_in_block). In general, this structure is a great place for any file system global variables.

Recalling the last post, we already have three structures for representing the superblock:



Two structures are understandable, but why do we need a third one? The fact is that Linux does not know anything about our file system, so it’s likely that super_block (like inode, like any other Linux kernel structure) does not contain all the fields we need. Therefore, we are forced to start our own additional structures and link them with the structures of the nucleus. How to connect them?

In the core there are two common ways of organizing such a connection (let's call them composition and inheritance). For the super block we will use the composition. This method requires support from the kernel - there is an interesting field inside the super_block structure :

 struct super_block { ... void *s_fs_info; ... }; 


In this field we can save a pointer to any data, in fact, there we will save a pointer to aufs_super_block. And wherever we have access to the super_block structure, we can access the aufs_super_block structure. But yes, these are all lyrics; our task is to read the superblock from the disk. To do this, we will write a couple of functions:

 static struct aufs_super_block *aufs_super_block_read(struct super_block *sb) { struct aufs_super_block *asb = (struct aufs_super_block *)kzalloc(sizeof(struct aufs_super_block), GFP_NOFS); struct aufs_disk_super_block *dsb = NULL; struct buffer_head *bh = NULL; if (!asb) { pr_err("aufs cannot allocate super block\n"); return NULL; } bh = sb_bread(sb, 0); if (!bh) { pr_err("cannot read 0 block\n"); goto free_memory; } dsb = (struct aufs_disk_super_block *)bh->b_data; aufs_super_block_fill(asb, dsb); brelse(bh); if (asb->asb_magic != AUFS_MAGIC) { pr_err("wrong magic number %u\n", (unsigned)asb->asb_magic); goto free_memory; } return asb; free_memory: kfree(asb); return NULL; } 


The first thing that this function does is allocate memory for the structure of the superblock. There are quite a few ways to allocate memory in the kernel, kzalloc (and kmalloc along with it) is the easiest. It works like a regular malloc, only requires the transfer of an additional set of flags. The difference between kzalloc and kmalloc is that the first one fills the allocated memory with zeros (which simply boils down to passing an additional flag inside kmalloc).

I mentioned the flags, why are they? The fact is that different parts of the kernel must satisfy different guarantees. For example, in the context of processing a network packet, you cannot block, and to use DMA you need to allocate memory in a special region of memory. Since memory allocation is used everywhere, a “settings” mechanism is required. In our case, the GFP_NOFS flag is used, which says that the memory allocator will not access the file system tools, which is logical when implementing the file system, although in this particular case it is not necessary.

Naturally in the core, do not forget to check that the memory was allocated without problems.

The next crucial point is the call to the sb_bread function. Here it is reading from the disk! The function takes a pointer to the superblock and the block number to be read is quite simple. The function returns a pointer to the buffer_head structure, and the block data itself is accessible via the b_data field of this structure.

Naturally, in this case, do not forget to check that the reading was successful.

Next we simply convert the pointer to char to a pointer to the aufs_disk_super_block structure. The aufs_super_block_fill function fills the aufs_super_block structure using aufs_disk_super_block without doing anything unusual:

 static inline void aufs_super_block_fill(struct aufs_super_block *asb, struct aufs_disk_super_block const *dsb) { asb->asb_magic = be32_to_cpu(dsb->dsb_magic); asb->asb_inode_blocks = be32_to_cpu(dsb->dsb_inode_blocks); asb->asb_block_size = be32_to_cpu(dsb->dsb_block_size); asb->asb_root_inode = be32_to_cpu(dsb->dsb_root_inode); asb->asb_inodes_in_block = asb->asb_block_size / sizeof(struct aufs_disk_inode); } 


As it is not difficult to guess, the function be32_to_cpu converts the number from big endian to the byte order used by the platform.

After we have finished working with the block, we need to release it; for this, the brelse function exists . It actually just reduces the reference count to this block. The block will not be released immediately, as soon as the reference count reaches 0 — for blocks in the kernel, the garbage collector works, which without serious need will not release the block. The reason is that reading blocks from a disk is a rather expensive operation, so it is reasonable to support the cache of read blocks, and when you read the same block again, return the already read block (if, of course, it is still present in the cache).

The last thing we do is check the magic number, you need to make sure that the disk actually contains aufs.

For those who have paid attention to goto, goto is used quite often in the kernel. Basically, for the organization of error handling - there are no exceptions in the C language, and the idea of ​​dividing the main execution path and error handling is quite attractive, then goto comes to the rescue. In this case, the use of goto gives almost nothing - I intend to put it here, as an example of why it is used. In general, among the developers of the goto haters kernel there are not so many, so there are places in the code that are abusing the hapless operator - you should be ready for this.

The attentive reader probably paid attention to one discrepancy. As I said, file systems can work with different block sizes, and this information is most likely stored in the superblock. So what is the size of the block that the sb_bread function will read when reading a superblock? In our case, everything is simple, by default, the block size is set to the block size of the block device (how many blocks ...). And we hope that its size is sufficient for the structure of the superblock - in our case it is.

We wrote a function to read the superblock, we will call it from aufs_fill_super (see previous post), now it looks like this:

 static int aufs_fill_sb(struct super_block *sb, void *data, int silent) { struct inode *root = NULL; struct aufs_super_block *asb = aufs_super_block_read(sb); if (!asb) return -EINVAL; sb->s_magic = asb->asb_magic; sb->s_fs_info = asb; sb->s_op = &aufs_super_ops; if (sb_set_blocksize(sb, asb->asb_block_size) == 0) { pr_err("device does not support block size %u\n", (unsigned)asb->asb_block_size); return -EINVAL; } root = aufs_inode_get(sb, asb->asb_root_inode); if (IS_ERR(root)) return PTR_ERR(root); sb->s_root = d_make_root(root); if (!sb->s_root) { pr_err("aufs cannot create root\n"); return -ENOMEM; } return 0; } 


As I already mentioned, we save the pointer to aufs_super_block in the s_fs_info field. In addition, we set the correct block size by calling sb_set_blocksize . As the comment inside the function says, the block size should be from 512 bytes to the page size - this is the reason for our choice of block sizes. If the file system should work with a large block size, additional efforts will be required (though not so much).

So we allocated aufs_super_block in dynamic memory, which means we have to release it. To do this, we need to make some changes to another function from the previous post:

 static void aufs_put_super(struct super_block *sb) { struct aufs_super_block *asb = (struct aufs_super_block *)sb->s_fs_info; if (asb) kfree(asb); sb->s_fs_info = NULL; pr_debug("aufs super block destroyed\n"); } 


It is not hard to guess that the kfree function is a pair for the kmalloc function, more precisely even the function, since there are several kfree implementations in the kernel (still here and here ), but we will not go into details.

Another important change inside the aufs_fill_sb function is the aufs_inode_get call. In the last article we created a fictitious inode, now we will learn how to read them from disk.

But before that, I’ll draw your attention to an interesting point - the IS_ERR and PTR_ERR pair . These are simple conversions of pointers to a number and back, based on the fact that the kernel has full information about the location of its memory and, accordingly, which bits of the pointer can be used for other than its intended purpose. This is the simplest example of using knowledge about the structure of a pointer; there are more interesting ones, and not only in the core .

We will start working with index nodes with an expansion of the super_operations structure, which we met last time. Now we fill it in this way:

 static struct super_operations const aufs_super_ops = { .alloc_inode = aufs_inode_alloc, .destroy_inode = aufs_inode_free, .put_super = aufs_put_super, }; 


We added a couple more pointers to the aufs_inode_alloc and aufs_inode_free functions. These are specific functions for allocation and inode release, then we will encounter SLAB (in fact, we have already encountered this beast in the form of kmalloc) and RCU (quite a bit).

So, the memory allocation for the index node will start with the definition of another structure - the representation of the index node in memory (as was the case with the superblock):

 struct aufs_inode { struct inode ai_inode; uint32_t ai_block; }; 


This time we will use "inheritance" instead of composition. Inheritance in C does not look tricky (which is not surprising, given that C does not support inheritance). To do this, we simply make the first field of the aufs_inode structure the base structure (base class) - the inode structure. Thus, a pointer to aufs_inode can be used as a pointer to an inode, as well, and vice versa (if of course we know for sure that this pointer refers to aufs_inode).

Compared to composition, “inheritance” does not in itself require support from the kernel, moreover, it is more profitable from the point of view of the number of memory allocations - one allocation is required for each index node instead of two (as was the case with the superblock). Also, unlike the superblock, almost all the required fields are already present inside the inode. However, this is the exception rather than the rule, because our file system stores the data on the disk is very simple.

To allocate memory for index nodes, we will use the SLAB allocator. SLAB allocator - caching allocator that allows you to allocate blocks of memory of the same size. It is not difficult to guess that due to this limitation, you can simplify memory management and speed up memory allocation. The SLAB allocator asks the OS for large chunks of memory and allocates small areas of them on demand, respectively, requests to the OS memory manager occur less frequently, and user requests are satisfied faster.

But initially the gain in speed of memory allocation when using SLABs was not only (and not so much) due to simpler memory management, but by reducing the cost of initializing this memory. Indeed, the SLAB allocator is often used not just to select objects of the same size, but to select objects of the same type, which allows you to skip the initialization of some fields when you re-allocate one chunk of memory. For example, mutexes, spinlockes, and other similar objects, when an object is released, most likely have the “correct” meaning, and when re-allocated, they do not need to be reinitialized. For details and measurements, please refer to the original article .

Currently, Linux has three different types of SLAB allocators - SLAB, SLUB and SLOB. We will not go into the differences between them, they provide the same interface. So, to create a SLAB allocator, we will use the following function:

 int aufs_inode_cache_create(void) { aufs_inode_cache = kmem_cache_create("aufs_inode", sizeof(struct aufs_inode), 0, (SLAB_RECLAIM_ACCOUNT|SLAB_MEM_SPREAD), aufs_inode_init_once); if (aufs_inode_cache == NULL) return -ENOMEM; return 0; } 


When creating a SLAB, the kmem_cache_create function is passed the name, the size of the object, the initialization function (a function that will be called only once when the object is first allocated), and a couple more parameters, the essence of which I will not go into. But in order not to leave those who are interested at all without information, I will say that creating SLAB for index nodes in all file systems looks the same - the differences are not significant.

We will call the aufs_inode_cache_create function when the module is loaded, before registering the file system in the kernel.There is also a pair function that we will call when the module is unloaded:

 void aufs_inode_cache_destroy(void) { rcu_barrier(); kmem_cache_destroy(aufs_inode_cache); aufs_inode_cache = NULL; } 


The kmem_cache_destroy function destroys the SLAB allocator. By the time of the destruction, all objects from this cache should be freed, otherwise we will receive an unpleasant error message in the system log, along with a whole bunch of difficult to catch troubles.

Now the promised touch RCU. In a nutshell, RCU is a common synchronization mechanism in the kernel (as well as secure memory release for lock-free algorithms). RCU in itself deserves a separate article and there is such in Habré . Moreover, the creator of this technique, and in combination, the maintener RCU in the Linux kernel wrote a whole book in which he touched his brainchild too.

But we need to deal only with rcu_barrier from the entire zoo of RCU functions .(as with kfree, there is another implementation of this function here ). If it is simple, then this function will wait until all deferred work on the protected RCU data is completed, after which it will return control to the one who called it, respectively, the blocking function. Why do we need it, we will see below.

Let's return to the memory allocation, consider the function already mentioned above:

 struct inode *aufs_inode_alloc(struct super_block *sb) { struct aufs_inode *const i = (struct aufs_inode *) kmem_cache_alloc(aufs_inode_cache, GFP_KERNEL); if (!i) return NULL; return &i->ai_inode; } 


It uses the previously created SLAB allocator (through one of the kmem_cache_alloc implementations ) and returns a pointer to the inode — nothing unusual, but the release function is a bit more interesting:

 void aufs_inode_free(struct inode *inode) { call_rcu(&inode->i_rcu, aufs_free_callback); } 


Here we again face RCU. Here it is worth saying a few words about lock-free algorithms, the problem of such algorithms is that without locks there is no guarantee that the object is not used in parallel by any other execution thread, which means it is impossible to free the memory occupied by this object - another thread can store a pointer him Therefore, in lock-free algorithms, you have to think about strategies for safe memory release, and RCU provides the means to solve this problem. All implementations of the call_rcu function postpone the execution of a certain function (in our case, the release function aufs_free_callback) until it becomes safe. And the rcu_barrier already mentioned above is waiting for the completion of all deferred functions.

You're tired? It's okay, we are already approaching the final. Now we will read the index node from the disk. To do this, I wrote the already mentioned aufs_inode_get function:

 struct inode *aufs_inode_get(struct super_block *sb, uint32_t no) { struct aufs_super_block const *const asb = AUFS_SB(sb); struct buffer_head *bh = NULL; struct aufs_disk_inode *di = NULL; struct aufs_inode *ai = NULL; struct inode *inode = NULL; uint32_t block = 0, offset = 0; inode = iget_locked(sb, no); if (!inode) return ERR_PTR(-ENOMEM); if (!(inode->i_state & I_NEW)) return inode; ai = AUFS_INODE(inode); block = aufs_inode_block(asb, no); offset = aufs_inode_offset(asb, no); pr_debug("aufs reads inode %u from %u block with offset %u\n", (unsigned)no, (unsigned)block, (unsigned)offset); bh = sb_bread(sb, block); if (!bh) { pr_err("cannot read block %u\n", (unsigned)block); goto read_error; } di = (struct aufs_disk_inode *)(bh->b_data + offset); aufs_inode_fill(ai, di); brelse(bh); unlock_new_inode(inode); return inode; read_error: pr_err("aufs cannot read inode %u\n", (unsigned)no); iget_failed(inode); return ERR_PTR(-EIO); } 


I will start the explanation with simple moments - the functions AUFS_SB and AUFS_INODE allow you to get a pointer to the structures aufs_super_block and aufs_inode, via pointers to super_block and inode, respectively. I will not give them the code (it is quite simple), because I have already described above how these structures are related.

Functions aufs_inode_block and aufs_inode_offset allow you to get the block number and the offset inside the block by the index node number - no magic, simple arithmetic, so I will not dwell on them either.

And now the interesting point - a couple of functions iget_locked and unlock_new_inode. As in the case of blocks, the kernel supports the inode cache, this is necessary not only in order not to once again read the index node from the disk. The fact is that one and the same file / directory can be opened by several processes at once, in this case all of them must operate with one instance of inode so that they can be synchronized with each other. A similar reasoning holds true for blocks, perhaps even to a greater degree, so this also applies to the block.

So the function idet_locked first looks for the inode in the cache and allocates memory for a new one if the inode is not found. If the index node was re-allocated and not found in the cache, the I_NEW flag will be set in the i_state field and the spinlock of this node will be captured (i_lock field). Therefore, our function first checks the i_state field, and if the I_NEW flag is cleared, simply return the cached inode. Otherwise, we have to fill in the inode, for this we read the necessary block from the disk (using the already known sb_bread).

The aufs_inode_fill function is exactly engaged in filling:

 static void aufs_inode_fill(struct aufs_inode *ai, struct aufs_disk_inode const *di) { ai->ai_block = be32_to_cpu(di->di_first); ai->ai_inode.i_mode = be32_to_cpu(di->di_mode); ai->ai_inode.i_size = be32_to_cpu(di->di_size); ai->ai_inode.i_blocks = be32_to_cpu(di->di_blocks); ai->ai_inode.i_ctime.tv_sec = be64_to_cpu(di->di_ctime); ai->ai_inode.i_mtime.tv_sec = ai->ai_inode.i_atime.tv_sec = ai->ai_inode.i_ctime.tv_sec; ai->ai_inode.i_mtime.tv_nsec = ai->ai_inode.i_atime.tv_nsec = ai->ai_inode.i_ctime.tv_nsec = 0; i_uid_write(&ai->ai_inode, (uid_t)be32_to_cpu(di->di_uid)); i_gid_write(&ai->ai_inode, (gid_t)be32_to_cpu(di->di_gid)); } 


Again, no magic, except for a couple of functions i_uid_write and i_gid_write . But they do not do anything special - just assign values ​​to the corresponding fields.

In addition, I will focus on the representation of time as a timespec structure , this structure consists of only a couple of numbers — the number of seconds and nanoseconds. That is, potentially time can be stored fairly accurately.

Finally, at the very end of the function, we must release the spinlock and return the pointer, for this purpose the unlock_new_inode function is used.

Instead of conclusion



Fasting turned out really great and even so does not cover all the moments. I tried to explain all the key parts of the implementation.

All sources are available by reference . There are now two folders in the repository - kern and user. As it is not difficult to guess, one stores the code of our module, the second - the code of the utility for formatting. The code has become more, and therefore the probability of errors in it has become more - comments, corrections, any constructive criticism and pull requests are welcome.

To get a disk image for mounting, you can do this:

 dd bs=1M count=100 if=/dev/zero of=image ./mkfs.aufs ./image 


Now you can use the image file as shown in the previous post.

Some examples of debugging output have been removed from the code examples in this article, but if you use the code from the repository, you can verify that the module is working using the dmesg command.

Source: https://habr.com/ru/post/224589/


All Articles