NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 c1t6d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0
zfs_metaslab_ops
contains a pointer to the space_map_ops_t
structure, in which there are pointers for seven functions that each particular algorithm uses. For example, in Illumos, the metaslab_df
algorithm is metaslab_df
, and the corresponding plan with points to functions looks like this: static space_map_ops_t metaslab_df_ops = { metaslab_pp_load, metaslab_pp_unload, metaslab_df_alloc, metaslab_pp_claim, metaslab_pp_free, metaslab_pp_maxsize, metaslab_df_fragmented };
metaslab_*_alloc()
and metaslab_*_fragmented()
is the allocator itself, and a function that decides how much free space is fragmented in a specific metaslab. Allocators that can be used: DF (Dynamic-Fit), FF (First-Fit), and two experimental, CDF and NDF - no one knows what they mean.int metaslab_df_free_pct = 4;
). The only advantage of FF is that it can only fill a fragmented meta-slab by 100%.freemap
map of the freemap
in the metaslab currently being recorded, sort it by size and / or proximity of pieces of continuous free space, and try to choose the most optimal data placement option in terms of speed recording, the number of movements of the disk head, and minimal fragmentation of the recorded data.metaslab_weight()
function works, which gives a small priority to the metaslabs that are on the outer regions of the disc plate (for the short-stroke effect). If you use only SSD, then it makes sense to tailor ZFS, disabling this part of the algorithm, because the short-stroking does not apply to SSD.metaslab_alloc()
are called (the allocator itself is for writing) and metaslab_free()
(free up space, collect garbage). metaslab_alloc(spa_t *spa, metaslab_class_t *mc, uint64_t psize, blkptr_t *bp, int ndvas, uint64_t txg, blkptr_t *hintbp, int flags)
*spa
- point to the structure of the data array itself (zpool); * mc - the meta-slabs class, which also includes a pointer for zfs_metaslab_ops
; psize
- data size; *bp
- pointer to the block itself; ndvas
is the number of independent copies of data that is required for a given block (1 for data; 2 for most metadata; 3 in some cases for metadata that are high in the AVL tree. The meaning of duplication of metadata is that if a single block with metadata for a segment of the tree is lost, we lose everything that is under it. Such blocks are called ditto blocks, and the algorithm tries to write them in different vdevs).txg
is the sequence number of the transaction group that we write; *hintbp
- a hint used to ensure that the blocks that are next to *hintbp
are also logically next to the disk and go to the same vdev; flags
- 5 bits that allow the allocator to find out if you need to use any specific allocation options - use or ignore the *hintbp
, and whether to use ganging (please write the group of child blocks to the same vdev as their header, for more efficient work ZFS prefetch and vdev cache). define METASLAB_HINTBP_FAVOR 0x0 define METASLAB_HINTBP_AVOID 0x1 define METASLAB_GANG_HEADER 0x2 define METASLAB_GANG_CHILD 0x4 define METASLAB_GANG_AVOID 0x8
/* * Allow allocations to switch to gang blocks quickly. We do this to * avoid having to load lots of space_maps in a given txg. There are, * however, some cases where we want to avoid "fast" ganging and instead * we want to do an exhaustive search of all metaslabs on this device. * Currently we don't allow any gang, zil, or dump device related allocations * to "fast" gang. */ #define CAN_FASTGANG(flags) \ (!((flags) & (METASLAB_GANG_CHILD | METASLAB_GANG_HEADER | \ METASLAB_GANG_AVOID)))
/* * If we are doing gang blocks (hintdva is non-NULL), try to keep * ourselves on the same vdev as our gang block header. That * way, we can hope for locality in vdev_cache, plus it makes our * fault domains something tractable. */
metaslab_alloc_dva()
. There are almost 200 lines of clever code in the function, which I will try to explain.mg_rotor
), using hints, if there are any. We skip the vdevs for which the recording is currently undesirable, for example, those in which one of the disks died, or a raidz-group is being restored. (Don't allocate from faulted devices.) We also skip discs that had some kind of write error, for which there will be only one copy on the discs. (Avoid writing single-copy data to a failing vdev.)metaslab_group_alloc()
, the best metaslab is selected, then we decide how much data to write to this metaslab, comparing the percentage of vdev use with others. This part of the code is very critical, so I quote it completely: offset = metaslab_group_alloc(mg, psize, asize, txg, distance, dva, d, flags); if (offset != -1ULL) { /* * If we've just selected this metaslab group, * figure out whether the corresponding vdev is * over- or under-used relative to the pool, * and set an allocation bias to even it out. */ if (mc->mc_aliquot == 0) { vdev_stat_t *vs = &vd->vdev_stat; int64_t vu, cu; vu = (vs->vs_alloc * 100) / (vs->vs_space + 1); cu = (mc->mc_alloc * 100) / (mc->mc_space + 1); /* * Calculate how much more or less we should * try to allocate from this device during * this iteration around the rotor. * For example, if a device is 80% full * and the pool is 20% full then we should * reduce allocations by 60% on this device. * * mg_bias = (20 - 80) * 512K / 100 = -307K * * This reduces allocations by 307K for this * iteration. */ mg->mg_bias = ((cu - vu) * (int64_t)mg->mg_aliquot) / 100; }
vdev_stat_t->vs_latency[]
in NexentaStor; in Illumos they haven’t added yet), and it can be used as one of the factors in recording new data, either taking into account both its and free space in any proportion, or using only it. I also wrote such a modified algorithm, but it is not used in production systems yet. It makes sense ever in the array there are disks of different types and speeds, or when one of the disks begins to die (to brake), but so far it is not so bad, and there are no errors on it.metaslab_weight()
(see the beginning of the article) works in the metalabs group, and through the space map
system, taking into account maxfree (maximum piece of uninterrupted free space), using the tracing of AVL-trees and the corresponding algorithm (DF, FF, CDF, NDF), stuff the data in an optimal way for the algorithm, after which we finally get the physical address of the block, which we will write to the disk, and the data goes to the queue to write to the sd
(Scsi Device) driver .Source: https://habr.com/ru/post/161055/
All Articles