📜 ⬆️ ⬇️

How ZFS Works - Part 2: metaslab

In the first part, I described how data is organized in vdev in ZFS. The second part describes how the algorithm for selecting the proper place where the recording will go at the moment works.

Here I will complicate the task a bit - in the first part only one vdev was described; here we will have several of them, since the algorithm should choose both vdev, where we will write the data block, and metaslab inside vdev. In a production system, there may be several dozen vdev, and it is critical to distribute the data correctly over them - we cannot re-balance them without copying all the data. The goal of the correct algorithm is to parallelize the data so that each device has approximately the same amount, even out uneven filling, but also not overload one of the devices (this will slow down the recording for the entire pool).

NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 

')
For a start, an important note: ZFS is designed to ensure that all devices in the pool are the same size. Otherwise, for example, if you add a 2TB disk to a pool of 1TB disks, a 2TB disk will result in twice as much data and it will affect the total IOPs of the system - the allocator algorithm takes into account the percent full, not the amount of data in bytes.

At the moment, there are four allocator algorithms in ZFS. The variable zfs_metaslab_ops contains a pointer to the space_map_ops_t structure, in which there are pointers for seven functions that each particular algorithm uses. For example, in Illumos, the metaslab_df algorithm is metaslab_df , and the corresponding plan with points to functions looks like this:

 static space_map_ops_t metaslab_df_ops = { metaslab_pp_load, metaslab_pp_unload, metaslab_df_alloc, metaslab_pp_claim, metaslab_pp_free, metaslab_pp_maxsize, metaslab_df_fragmented }; 


Five of these functions are used in all algorithms; the difference is actually only in metaslab_*_alloc() and metaslab_*_fragmented() is the allocator itself, and a function that decides how much free space is fragmented in a specific metaslab. Allocators that can be used: DF (Dynamic-Fit), FF (First-Fit), and two experimental, CDF and NDF - no one knows what they mean.

FF is the simplest of them - it writes data chunks to the first available space when tracing the AVL-tree in order, and divides the recording blocks into segments if they do not fit. Due to this, FF is a very slow algorithm, since to write one block it must trace the entire tree until a sufficient number of segments are typed. For writing 1GB of data, for example, worst case - 20 million segments of 512 bytes each, and very strong data fragmentation as a result. FF is used by other algorithms as the last option if they cannot find a place differently - for example, DF uses FF if there is less than 4% of free space in this metaslab ( int metaslab_df_free_pct = 4; ). The only advantage of FF is that it can only fill a fragmented meta-slab by 100%.

* DF algorithms work differently - they build a freemap map of the freemap in the metaslab currently being recorded, sort it by size and / or proximity of pieces of continuous free space, and try to choose the most optimal data placement option in terms of speed recording, the number of movements of the disk head, and minimal fragmentation of the recorded data.

For example, for all of them, the metaslab_weight() function works, which gives a small priority to the metaslabs that are on the outer regions of the disc plate (for the short-stroke effect). If you use only SSD, then it makes sense to tailor ZFS, disabling this part of the algorithm, because the short-stroking does not apply to SSD.

So, in the algorithms of the allocator, the data comes from the ZIO pipeline - from there, the functions metaslab_alloc() are called (the allocator itself is for writing) and metaslab_free() (free up space, collect garbage).

 metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp, int ndvas, uint64_t txg, blkptr_t *hintbp, int flags) 


It is transmitted there: *spa - point to the structure of the data array itself (zpool); * mc - the meta-slabs class, which also includes a pointer for zfs_metaslab_ops ; psize - data size; *bp - pointer to the block itself; ndvas is the number of independent copies of data that is required for a given block (1 for data; 2 for most metadata; 3 in some cases for metadata that are high in the AVL tree. The meaning of duplication of metadata is that if a single block with metadata for a segment of the tree is lost, we lose everything that is under it. Such blocks are called ditto blocks, and the algorithm tries to write them in different vdevs).

Further, txg is the sequence number of the transaction group that we write; *hintbp - a hint used to ensure that the blocks that are next to *hintbp are also logically next to the disk and go to the same vdev; flags - 5 bits that allow the allocator to find out if you need to use any specific allocation options - use or ignore the *hintbp , and whether to use ganging (please write the group of child blocks to the same vdev as their header, for more efficient work ZFS prefetch and vdev cache).

 define METASLAB_HINTBP_FAVOR 0x0 define METASLAB_HINTBP_AVOID 0x1 define METASLAB_GANG_HEADER 0x2 define METASLAB_GANG_CHILD 0x4 define METASLAB_GANG_AVOID 0x8 


 /* * Allow allocations to switch to gang blocks quickly. We do this to * avoid having to load lots of space_maps in a given txg. There are, * however, some cases where we want to avoid "fast" ganging and instead * we want to do an exhaustive search of all metaslabs on this device. * Currently we don't allow any gang, zil, or dump device related allocations * to "fast" gang. */ #define CAN_FASTGANG(flags) \ (!((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER | \ METASLAB_GANG_AVOID))) 


  /* * If we are doing gang blocks (hintdva is non-NULL), try to keep * ourselves on the same vdev as our gang block header. That * way, we can hope for locality in vdev_cache, plus it makes our * fault domains something tractable. */ 


Then, in fact, comes the most important part of the code, in which the logic of the choice of the metaslab to write: metaslab_alloc_dva() . There are almost 200 lines of clever code in the function, which I will try to explain.

First, we take all groups of meta-slabs for all vdevs, and go through them with the allocator cycle ( mg_rotor ), using hints, if there are any. We skip the vdevs for which the recording is currently undesirable, for example, those in which one of the disks died, or a raidz-group is being restored. (Don't allocate from faulted devices.) We also skip discs that had some kind of write error, for which there will be only one copy on the discs. (Avoid writing single-copy data to a failing vdev.)

The rotor works in a circle until the data sent to the recording runs out. Inside this cycle, the optimal vdev is selected in turn, then in it, using metaslab_group_alloc() , the best metaslab is selected, then we decide how much data to write to this metaslab, comparing the percentage of vdev use with others. This part of the code is very critical, so I quote it completely:

  offset = metaslab_group_alloc(mg, psize, asize, txg, distance, dva, d, flags); if (offset != -1ULL) { /* * If we've just selected this metaslab group, * figure out whether the corresponding vdev is * over- or under-used relative to the pool, * and set an allocation bias to even it out. */ if (mc->mc_aliquot == 0) { vdev_stat_t *vs = &vd->vdev_stat; int64_t vu, cu; vu = (vs->vs_alloc * 100) / (vs->vs_space + 1); cu = (mc->mc_alloc * 100) / (mc->mc_space + 1); /* * Calculate how much more or less we should * try to allocate from this device during * this iteration around the rotor. * For example, if a device is 80% full * and the pool is 20% full then we should * reduce allocations by 60% on this device. * * mg_bias = (20 - 80) * 512K / 100 = -307K * * This reduces allocations by 307K for this * iteration. */ mg->mg_bias = ((cu - vu) * (int64_t)mg->mg_aliquot) / 100; } 


For example, if we need to write 1MB of data to an array of two disks, one of which is 20% full, and the second 80%, we will write 819KB to the first, and 205KB to the second. Correction: the free space in the pool is compared to the free space in vdev, so the result will be slightly different. By the way, here you can do one very interesting thing - a few months ago I added latency statistics for each vdeva in ZFS (it is in vdev_stat_t->vs_latency[] in NexentaStor; in Illumos they haven’t added yet), and it can be used as one of the factors in recording new data, either taking into account both its and free space in any proportion, or using only it. I also wrote such a modified algorithm, but it is not used in production systems yet. It makes sense ever in the array there are disks of different types and speeds, or when one of the disks begins to die (to brake), but so far it is not so bad, and there are no errors on it.

Finally, iteration after iteration, the cycle sends all data to the metaslab group for writing, and ends for this transaction group (txg), and the metaslab_weight() (see the beginning of the article) works in the metalabs group, and through the space map system, taking into account maxfree (maximum piece of uninterrupted free space), using the tracing of AVL-trees and the corresponding algorithm (DF, FF, CDF, NDF), stuff the data in an optimal way for the algorithm, after which we finally get the physical address of the block, which we will write to the disk, and the data goes to the queue to write to the sd (Scsi ​​Device) driver .

Source: https://habr.com/ru/post/161055/


All Articles