Uncovered ZFS

When Sun designed ZFS, they threw away the rulebook and created something that had no direct analogues in any other UNIX-like system. David Chisnall examined what changes were made to conventional storage models, what were the foundations laid in the system, and how it all fits together.

Every few years, someone makes a prediction about how many private computer resources are likely to be needed in the future. Later, everyone laughs at themselves, at how naive they were. In ZFS design, Sun attempted to avoid this error.
')
While the whole world is switching to 64-bit file systems, Sun is introducing a 128-bit file system. Will we ever come to the need for such large sizes? Not right away. The mass of the planet Earth is approximately equal to 6 * 10 ^ 24 kg. If we took the corresponding mass of hydrogen, then we would get 3.6 * 10 ^ 48 atoms. The 128-bit file system can index 2 ^ 128, or 10 ^ 38 information allocation units. If you build a repository in which each atom is stored as a single bit of hydrogen (not counting the space you need for control logic), you can build about 300,000 devices exceeding Earth mass, if each of them has a 128-bit file system with 4Kb block placement data. We will build continent-sized hard drives before we reach the limits of ZFS space.

So is there any point in a 128-bit file system? Not really. However, if current trends continue, we will begin to reach the limits of 64-bit file systems in the next 5-10 years. Perhaps an 80-bit file system would be enough for other unforeseen limitations that could be the reason for the replacement before the space runs out, but most computers operate with 80-bit numbers more difficult than with 128-bit ones. Therefore, Sun has released a 128-bit system.

Stupid independence

When you write data to a disk (or network), you must be careful with the byte order. If you load and store data only on one machine, you can write out the contents of the machine registers, regardless of their presentation. Problems start when you start sharing data. Anything less than a byte is no problem (unless you are lucky enough to use VAX), but large volumes require the use of a well-defined byte order.

The two most common orders are named in the Egg philosophers from Jonathan Swift's book Gulliver's Travels. “Big-endian” notation places bytes in the form of 1234 while “little-endian” computers store them in 4321 order. Some computers use something like 1324, but mostly people try to avoid it. .

Most file systems are designed to work on a specific architecture. Although, even if it was later ported to a different architecture, each file system tends to store metadata in the byte order in which the native file system architecture stores it. Apple's HFS + is a good example of this practice. Since HFS + originated on PowerPC, file system data structures are stored in big-endian format. On an Intel-based Mac, you must invert byte order each time you load or write data to disk. The BSWAP instruction on x86 chips allows for a quick flip, but in any case, it's not too good for performance.

Sun found itself in an interesting position when it came to the order of the baito, when it began selling and supporting Solaris on SPARC64 and x86-64 architectures. SPARC64 - big-endian, and x86-64 - little-endian; whichever solution Sun chose, it would have made one of its file systems slower than the other Sun-supported architecture.

Sun solution? Do not choose. Each data structure in ZFS is written in the byte order in which the computer recorded it, along with a flag indicating which byte order was used. The Opteron ZFS section will be little-endian and the UltraSPARC controlled big-endian. If you divide the disc between two machines, everything will still work - and the more you write, the more it will be optimized for natural reading.

Blatant violation of the level structure

ZFS has been described in the Linux Kernel Mailing List as a “terrible tier-level violation.” This is not entirely accurate, ZFS is not a file system in the traditional sense of UNIX, but most likely a set of specific layers that provides an extended set of the usual file system. At this point, any VMS administrators in the audience are allowed to feel complacent and mutter to themselves: “UNIX has finally got a real file system. Maybe he is finally ready for industrial use. ”

The three ZFS layers are called the interface layer, the transaction object layer, and the joint storage layer. Going down the stack, these layers convert file system requests into object transactions, transactions into operations with virtual block devices, and finally virtual operations into real ones.

Some parts of this stack are optional, as we will see later.

Partition manager

At the bottom of the ZFS stack is the federated storage layer. This layer plays a role similar to the partition manager on the existing system.

Each virtual device (vdev) created by combining devices using one of the options — mirroring or RAID-Z. Once you've created the vdev, you join them to the storage pool. This approach provides some flexibility. If you have some data that needs to be fast and some that need to be safely stored, you can create a highly mirrored pool and a RAID-Z pool, and create file systems on the one that suits you best. Note that file systems should not be hosted continuously on vdev; Although they may appear to be sequential blocks of storage space at the upper levels, they may not be close at all.

One of the key ideas behind ZFS design was that partitioning should be as simple as creating a directory. For example, this makes it possible to apply quotas in ZFS, giving each user a separate partition for the home directory, which will grow dynamically in the storage pool.

Unlike other partition managers, ZFS defines I / O scheduling. Each transaction has a specific priority and timeframe, which the scheduler processes on the vdev-level system.

Object layer

The middle layer of ZFS is the transactional object layer. The basis of this layer is the Data Management Unit (DMU) (data management module) and on many ZFS DMU block diagrams this is all that you will see in this layer. The DMU provides objects for the top layer and allows atomic operations to be performed.

If you have ever had a power failure while writing a file, you may have started fsck, scandisk or something like that. In the end, you probably will have some corrupted files. If these were text files, then you may be lucky; Damage can be repaired easily. Otherwise, if the files had a mixed structure, you could lose the whole file. Database applications solve this problem using the transaction mechanism; they put something on a disk, saying, “I'm going to do it,” and they do it. Then they write "I did it" in the log. If something goes wrong somewhere in the process, the database can simply roll back to the state before the operation begins.

Many new file systems use journaling, which does the same things as databases at the file system level. The advantage of logging is that the state of the file system is always complete; after a power failure, you just need to play the log, and not scan the entire disk. Unfortunately, this integrity does not apply to files. If you perform two write operations from a user application, it is possible that one will be completed, while the other will not. This model is causing some problems.

ZFS uses a transactional model. You can start a transaction by producing a certain number of records, and they will either all succeed or all will fail. This is possible because ZFS uses the copy-to-write mechanism. Every time ZFS writes down some data, it writes this data to the spare disk space. Then, it updates the metadata, saying "This is a new version." If the writing process does not reach the metadata update stage, none of the old data will be overwritten.

One of the sides of the copy-on-write effect is that it allows you to create permanent snapshots. Some file systems, such as UFS2 on FreeBSD and XFS on IRIX, already support snapshots, so this is not a new concept. Standard technology is creating snapshots section. Once you make a snapshot, each write operation is replaced by a sequence that copies the original onto the snapshot section, and then performs the recording. Needless to say that this approach is very expensive.

With ZFS, all you need to create snapshots is to increase the number of links in the section. Each write operation is already non-destructive, and the only thing that can happen is that the update metadata operation will not be able to remove links to the old location. The other side of the effect of using this mechanism is that snapshots are first-class file systems with their own rights. You can record on snapshot just like any other section. For example, you can create a file system snapshot for each user and allow them to do whatever they want on it without affecting other users. This feature is especially useful in combination with Solaris Zones.

Pretending to the file system

All this is fine, to have an object-based, transactional storage system, but who is going to use it? All my applications want to chat with something that is very similar to the UNIX file system. And here comes the ZPL, POSIX layer in ZFS. ZPL converts between POSIX file operations (read, write, etc.) and DMU operations below. It is responsible for managing the directory structure and allows the use of ACLs (access control lists).

In addition to ZPL, ZFS has another module in the interface layer, known as ZVOL. This layer performs a simpler conversion; instead of creating the appearance of a POSIX-compatible file system, it appears to be a raw block device, which is useful for implementing existing file systems based on ZFS storage pools. The FreeBSD port initially uses the existing UFS2 file system on top of the ZVOL device. It seems to me that the Apple port will use HFS + over ZVOL to allow Apple to support HFS + metadata.

Some intriguing possibilities are available for future work in this layer. Since ZFS already supports transactions, it is possible that SQL or a similar interface can be used in this layer. Based on the low cost of creating file systems, each user can create databases on the fly and get a much more flexible interface than the one provided by the POSIX layer. The Microsoft WinFS problem — it’s too difficult to get everyone to maintain an approach that was not file-based — would not be used, since the tool would increase rather than replace the current file system.

What does she not do?

Currently, the biggest disadvantage of ZFS is the lack of encryption.

NTFS has file-based encryption, and most partition managers have block-level encryption mechanisms. Fortunately, we paid attention to this problem, and ZFS can use the same mechanism that is used for compression.

Well defined quotas are also not supported. You can create extensible partitions with a maximum size, but you cannot set the maximum number of files that a user can create in a partition.

Last word in RAID?

One of the most exciting features of ZFS is RAID-Z. A modern hard drive is a device with a rather boring interface. This is an array consisting of blocks of fixed size that can be read or written. Since RAID is usually implemented like a block layer (often at the hardware level, transparent to the operating system), RAID devices also provide this interface. In a RAID-5 array with three disks, writing a block causes the block to be saved to disk 1, and the result of a block XOR, respectively, is one of disk 2 or 3. This causes two interrelated problems:

If you are lucky, you can guarantee atomic write operations on one disk, but it’s almost impossible to get atomic recordings on a group of disks. If something breaks between the first block record and the checksum, the system will contain nonsense for this block index on all disks. Modern RAID controllers bypass this problem by storing records in non-volatile RAM until they receive confirmation from the disk that the data has been safely stored.
In the above scenario, writing one block to disk 1 requires that you then read the block from disk 2 and save the checksum to disk 3. This additional read operation in the middle of each record can be expensive.

So what does RAID-Z do differently? First, the RAID-Z array is not as dumb as the RAID array; he has some awareness of what is stored in it. The key component is the variable width of the stripe (!). With existing implementations of RAID, it is either 1 byte (for example, every odd byte will be written to disk 1, each even number to disk 2, and each evenness byte to disk 3), or the block size. In ZFS, the stripe size is determined by the size of the record. Each time you write to a disc, you record the entire stripe.

Such a structure solves both the problems mentioned above. Since ZFS is transactional, the stripe is either recorded correctly and the metadata is updated or not. If the hardware fails during the intermediate recording, this means that the recording failed, but the existing data on the disk will not be affected. Simply put, since the stripe contains only writeable data, you will never need to read something from the disk to write.

RAID-Z has become possible only due to the new structure of ZFS layers. You can restore a RAID-5 partition when the disk fails, and say: “The XOR of all bits on index 0 on each disk gives 0 in total, so what should our missing disk contain?”. With RAID-Z, this is not possible. Instead, you will need to go through the file system metadata. A RAID controller that is a block device will not be able to do this. One of the additional bonuses is that the hardware RAID controller, when requesting a recovery, must recreate the disk — even those blocks that were not used, while RAID-Z only needs to restore the blocks used.

Not being part of RAID-Z, ZFS includes another feature that helps solve data loss problems: since each block contains the SHA256 hash, the bad sector on the disk will appear as containing errors, even if the disk controller does not notice this. This is an advantage over existing RAID implementations. For example, using RAID-5, you can restore an entire partition, but if a single sector on the disk is damaged, the entire disk can report an existing error. The RAID-Z partition can tell you which disk contains an error (the one whose block does not match the hash) and recover data from another. It also serves as an early warning that the disk may be damaged.

With all of this variable stripe size, you might be wondering what happens when the stripe size is smaller than a single block. The answer is simple: instead of calculating parity, ZFS simply mirrors the data.

One thing I found extremely interesting in ZFS is that it will work better on a block device, which has a lower cost of random reads. It’s almost as if the designers had flash drives, instead of hard drives, in their understanding.

How can I get this?

Will ZFS be available for your operating system? For [Open] Solaris users, the answer will be “yes”, for the rest - “maybe”. If you are using Windows, then maybe not. For Linux, the situation is somewhat more complicated. The OpenSolaris implementation is released under a CDDL license, with which the GPL is incompatible. There are two ways to support this file system on Linux. The first is to make a completely own implementation, which will require tremendous efforts and therefore is unlikely to happen in the near future. The other is to port ZFS to FUSE and run it as a user process. This work is already being done, but it seems that the result will be much slower than the version implemented at the kernel level and will not be possible to use on boot partitions. Ubuntu users who want ZFS support can switch to Nexenta, which uses the OpenSolaris kernel and the GNU user environment.

, CDDL – , , , , , « , ». FreeBSD DragonFlyBSD, , MacOS X — , , ZFS.

Source: https://habr.com/ru/post/62681/

All Articles