This article is a translation of that part of the article on KernelNewbies, which describes the features of the Ext4 file system. The last section of the article, on the use of Ext4, has already been published on Habré.Ext4 is the result of the evolution of Ext3, the most popular file system in Linux. In many aspects, Ext4 represents a bigger step forward than Ext3 than Ext3 was relative to Ext2. The most significant improvement in Ext3 compared to Ext2 was journaling, while Ext4 assumes changes in important data structures, such as, for example, data files.
This allowed us to create a file system with a more advanced design, more productive and stable and with an extensive set of functions.
1. Compatibility
Any existing Ext3 file system can be converted to Ext4 by a simple procedure consisting of running a couple of commands in read-only mode. This means that you can improve performance and capacity and improve the capabilities of your existing file system without reformatting and without reinstalling the OS and programs. If you want to take advantage of Ext4 in a production-based system, you can also update the file system. This procedure is safe and does not put your data at risk (in this case, of course, it is recommended to make a backup of important data. However, this should be done even if you are not going to change the file system).
')
Ext4 will only use new structures for new data, while old ones will remain unchanged. If necessary, they can be read and modified. This certainly means that, once changing the file system to Ext4, it will no longer be possible to return Ext3.
It is also possible to mount the Ext3 file system as Ext4 without using the new data format, which will allow later to mount it again as Ext3. In this case, of course, you can not take advantage of the many advantages of Ext4.
2. Larger file and file system size
Today, the maximum size of the Ext3 file system is 16 terabytes, and the file size is limited to 2 terabytes. Ext4 adds 48-bit block addressing, which means that the maximum size of this file system is one exabyte, and files can be up to 16 terabytes. 1 EB (exabyte) = 1,048,576 TB (terabyte), 1 EB = 1024 PB (petabyte), 1 PB = 1024 TB, 1 TB = 1024 GB. Why 48-bit and not 64-bit? There were a number of restrictions that would have to be removed in order to make Ext4 fully 64-bit, and there was no such task before Ext4. The data structures in Ext4 were designed taking into account the required changes, so one day in the future support for 64 bits in Ext4 will appear. In the meantime, you have to be content with one exabyte.
Note: The code for creating file systems larger than 16 terabytes at the time of this writing is not contained in any of the stable releases of e2fsprogs. In the future it will be added.
3. Scalable Subdirectories
Currently, one Ext3 directory cannot contain more than 32,000 subdirectories. Ext4 removes this restriction and allows you to create an unlimited number of subdirectories.
4. Extents
Traditional Unix-derived file systems, such as Ext3, use an indirect block mapping scheme to track each block responsible for storing file data. This approach is not effective for large files, especially when deleting and truncating such files, because the map of correspondence contains one record for each separate block. In large files, there are many blocks, their matching cards are large, and they are processed slowly.
In modern file systems, a different approach is used, based on the so-called extents. An extent is basically a set of consecutive physical blocks. He seems to be telling us: "This data is in the following n blocks." For example, a file of 100 megabytes in size may be stored in a single extent of the same size, instead of being split into 25600 4-kilobyte blocks, addressed by indirect mapping. Huge files can be divided into several extents.
Through the use of extents, performance is improved, and fragmentation is also reduced, since extents facilitate continuous data placement.
5. Multi-block distribution
If in Ext3 you need to write new data to disk, a special block allocation mechanism determines which free blocks will be used for this. The problem is that in Ext3 this mechanism distributes in one go only one block (4 kilobytes). This means that if you need to record, say, the previously mentioned 100 megabytes of data, you will need to refer to the distribution mechanism 25600 times (we are talking about some 100 megabytes!). Not only is it inefficient, it also does not allow to optimize the distribution policy, since the corresponding mechanism does not have a clue about the actual amount of data to be recorded, but knows only one and only one block.
Ext4 uses a multi-block allocation mechanism (multiblock allocator, mballoc) which allows you to distribute any number of blocks with a single call and avoid huge overhead. Due to this, the performance grows significantly, which is especially noticeable with deferred distribution (see below) using extents. This feature does not affect the format of the data.
It can also be noted that the Ext4 block allocation and inode mechanism also received other improvements, which are described
in detail
in this document .
6. Deferred distribution
Deferred distribution is a way to improve performance without affecting the data format and is present in modern file systems such as XFS, ZFS, btrfs, and Reiser 4.
The essence of this method is to delay the allocation of blocks as much as possible - in contrast to the approach used in traditional file systems (such as Ext3, reiser3, etc.): allocate blocks immediately, at the earliest opportunity. For example, if a process writes by a write () call, the file system will allocate blocks for writing immediately — even if the data is not yet written to disk, but will be kept in the cache for some time. The disadvantages of this approach, for example, are that if a process continuously writes to a growing file, successive write () calls continuously allocate data blocks, and it is not known whether the file will grow further.
When using deferred allocation, blocks are not immediately allocated when writing to write (). Instead, the distribution is postponed until the file is written from the cache to disk. Thanks to this, the mechanism is able to optimize the distribution process. The greatest gain is obtained when using the two previously mentioned features - extents and multi-block distribution, because often there is a situation when the final file is written to disk in the form of extents distributed using mballoc. This gives a significant performance boost, and sometimes greatly reduces data fragmentation.
7. Fast fsck
Fsck is a very slow operation, especially for its first stage, checking all inodes in the file system.
In Ext4, after the inode-table of each group, a list of unused inodes is stored (provided with a checksum for reliability), so fsck will not check such inodes. The result is a reduction in the scan time from 2 to 20 times, depending on the number of inodes used (see
http://kerneltrap.org/Linux/Improving_fsck_Speeds_in_Ext4 ).
The fact that the list of unused inodes is made up by fsck, not Ext4, will be well seen if you run fsck to build a list of unused inodes, and when only the next start of fsck runs faster (fsck is still needed when you convert Ext3 to Ext4).
In addition, the fsck acceleration is influenced by another feature - “flexible groups of blocks”; they also speed up other file operations.
8. Journal checksums
The log is the most frequently used part of the disk, as a result of which the blocks of which it is composed become especially sensitive to equipment failures. Moreover, attempting to recover from a damaged log can lead to even more massive data damage. Ext4 calculates checksums of log data, which allows to determine the fact of their damage. This has one more advantage: thanks to checksums, you can turn the Ext3 two-phase journal fixing system into a single-phase one, which speeds up file operations in some cases by up to 20%, thus both reliability and performance are improved at the same time.
Note: the part responsible for performance - asynchronous logging - is now disabled by default, and will be enabled in a later release, when it can be achieved with reliable operation.
9. Mode without logging
Journaling ensures the integrity of the file system by logging all changes occurring on the disk. But it also introduces additional overhead for disk operations. In some special situations, journaling and the benefits it provides may be redundant. Ext4 allows you to disable logging, which results in a
small performance boost .
10. Online defragmentation
This feature is still in development and will be included in one of the future releases.
Although deferred and multi-block distribution and extents help reduce file system fragmentation, it can still grow over time.
For example: you create three files in one directory and they are located on the disk one after the other. Then, once you decide to update the second file, and at the same time the file becomes somewhat larger - so that there is not enough space for it. There are no other solutions besides separating the non-enclosing file fragment and placing it on another place of the disk or allocating the file a consecutive larger disk area in another place, far from the first two files, which will cause the disk head to move if the application needs to read all files in the directory (for example, the file manager will create thumbnails for image files).
In addition, the file system can only take care of certain types of fragmentation and it cannot know, for example, that it should keep all the files that are required when loading, next to each other, because it simply does not know which of them are required when loading. To solve this problem, Ext4 will support online defragmentation.
There is also an e4defrag utility that allows you to defragment both individual files and the entire file system.
11. Improvements related to inode
Larger inodes, nanosecond timestamps, fast extended attributes, inodes reservations ...
- Larger inodes: Ext3 supports custom size inodes (by specifying the -I parameter with mkfs), but the default inode size is 128 bytes. In Ext4, it will be 256 bytes. This was required to accommodate several additional fields (such as nanosecond timestamps and inode versions), and the remaining inode space will be used to store those extended attributes that are small enough to fit there. This will make access to such attributes much faster and improve the performance of applications using them by 3–7 times.
- The essence of inode reservation is to allocate several inodes when creating a directory, in anticipation that they will be used in the future. This improves performance because newly created files in this directory will be able to use reserved inodes. Therefore, creating and deleting files is more efficient.
- Nanosecond timestamps (nanosecond timestamps) mean that inodes such as, for example, the modification time get nanosecond accuracy (in Ext3 it was equal to a second).
12. Sustainable redistribution
This feature, already available in Ext3 in the latest kernel versions and emulated by glibc in file systems that do not support it, allows applications to pre-allocate disk space by reporting their needs to the file system. That, in turn, allocates the required number of blocks and data structures, but they are empty until the application actually writes to them.
This is exactly what P2P applications do, for example, allocating space for data that will appear there only after hours or days. However, this is implemented much more efficiently - at the file system level and with a universal API.
There are several applications to this: firstly, to prevent the same applications (such as P2P) from being executed by inefficiently filling files with zeros - the necessary blocks will be allocated at once.
Secondly, to reduce fragmentation - again, because the blocks are allocated only once, as continuously as possible.
Third, to ensure that the application will have as much space as it needs, which is especially important for real-time applications, since the file system may suddenly overflow during an important operation.
This feature is available through the libc posix_fallocate () interface.
13. The “barrier” mechanism is enabled by default.
This is an option that ensures the integrity of the file system at the cost of some performance loss (it can be disabled using the “mount -o barrier = 0”, it is recommended to do this when measuring performance).
Excerpt
from the LWN article : “The file system code must be absolutely sure before creating a record of the [journal] that all information about the transaction has been placed in the journal. Just making a record in the right order is not enough; modern drives have a large cache and change the write order to optimize performance. Therefore, the file system must explicitly tell the disk to write all log data to media before creating a commit record; if a commit record is first created, the log may be corrupted. The blocking system of input-output of the kernel provides such an opportunity due to the use of the mechanism of “barriers” (barriers); in other words, the “barrier” forbids the recording of any blocks sent after it, until such time as everything that was sent before the “barrier” will be transferred to the carrier. With the use of “barriers,” the file system can guarantee that everything that is on the disk is complete at any time. ”