Glusterfs + erasure coding: when you need a lot, cheap and reliable

Glaster in Russia has very few people, and any experience is interesting. We have it big and industrial and, judging by the discussion in the last post , claimed. I talked about the very beginning of the experience of transferring backups from Enterprise Storage to Glusterfs.

This is not hardcore enough. We did not stop and decided to collect something more serious. Therefore, we will discuss such things as erasure coding, sharding, rebalancing and throttling, load testing, and so on.

More theory of volumas / subevolumes
hot spare
heal / heal full / rebalance
Conclusions after reboot 3 nodes (never do this)
How does the record affect the subwoofer load at different speeds from different VMs and shard on / off
rebalance after discing
fast rebalance

What they wanted

The task is simple: to collect a cheap but reliable stack. Cheap as possible, reliable - so that it is not scary to keep our own files on sale. Until. Then, after long tests and backups to another storage system - also client.

Application (sequential IO) :

- Backups
- Test Infrastructure
- Test storage for heavy media files.
We are here.
- Combat file dumping and serious test infrastructures
- Storage for important data.

Like last time, the main requirement is the speed of the network between instances of Glaster. 10G at first is normal.

Theory: what is dispersed volume?

Dispersed volume is based on erasure coding (EC) technology, which provides quite effective protection against disk or server failures. It's like RAID 5 or 6, but not quite. It stores the encoded file fragment for each brik in such a way that only a subset of the fragments stored on the remaining bricks is required to restore the file. The number of bricks that may not be available without losing access to the data is configured by the administrator during volume creation.

What is a subvolume?

The essence of a subvolume in GlusterFS terminology is manifested along with distributed volums. In distributed-disperced erasure, coding will work just within the framework of a subevolume. And in the case of, for example, with distributed-replicated data will be replicated within the framework of the subevolume.
Each of them is distributed to different servers, which allows them to loose or withdraw to sync. In the figure, servers (physical) are marked in green, dashed lines are sub-volums. Each of them is presented as a disk (volume) to an application server:

It was decided that the distributed-dispersed 4 + 2 configuration on 6 nodes looks quite reliable, we can lose 2 of any servers or 2 disks within each subvolume, while continuing to have access to the data.

We had 6 old DELL PowerEdge R510 with 12 disk slots and 48x2TB 3.5 SATA disks. In principle, if there are servers with 12 disk slots, and having disks on the market up to 12TB, we can collect a storage of up to 576TB of usable space. But do not forget that even though the maximum HDD sizes continue to grow from year to year, their performance stays in place and a 10-12TB rebuild disk can take you a week.

Volum creation:
A detailed description of how to prepare briks, you can read in my previous post

gluster volume create freezer disperse-data 4 redundancy 2 transport tcp \ $(for i in {0..7} ; do echo {sl051s,sl052s,sl053s,sl064s,sl075s,sl078s}:/export/brick$i/freezer ; done)

We create, but do not rush to run and mount, as we still have to apply several important parameters.

What we got:

Everything looks quite normal, but there is one nuance.

It consists in writing to the bricks of such a volum:
Files are placed alternately in subsubjects, and not evenly smeared on them, therefore, sooner or later we will be rested in its size, and not in the size of the whole volume. The maximum size of the file that we can put in this storage is = the useful size of the subspace minus the space already occupied on it. In my case, this is <8 TB.

What to do? How to be?
This problem is solved by sharding or stripe wolume, but, as practice has shown, the stripe works very badly.

Therefore, we will try sharding.

What is sharding, in detail here .

What is sharding, briefly :
Each file that you put in the volume will be divided into parts (shards), which are relatively evenly decomposed in subs. The shard size is specified by the administrator, the default value is 4 MB.

We enable sharding after creating a volume, but before it was launched :

 gluster volume set freezer features.shard on

We set the size of the shard (which one is optimal? Guys from oVirt recommend 512MB) :

 gluster volume set freezer features.shard-block-size 512MB

Empirically, it turns out that the actual size of the shard in a bricket using dispersed voluum 4 + 2 is equal to shard-block-size / 4, in our case 512M / 4 = 128M.

Each shard according to the logic of erasure coding is decomposed into bricks in the framework of a sub-curve like this: 4 * 128M + 2 * 128M

To draw the failure cases that the gluster of this configuration is going through:
In this configuration, we can survive the fall of 2 nodes or 2 any drives within one sub-rev.

For the tests, we decided to slip the resulting storage under our cloud and run fio from the virtual machines.

We turn on sequential recording with 15 VMs and do the following.

Rebut of the 1st node:
17:09
It looks uncritical (~ 5 seconds of inaccessibility by the ping.timeout parameter).

17:19
Launched heal full.
The number of heal entries is only growing, probably due to the large level of writing to the cluster.

17:32
It was decided to turn off recording from VM.
The number of heal entries has begun to decrease.

17:50
heal is done.

Rebut 2 nodes:

The same results are observed as with the 1st node.

Rebut 3 nodes:
The mount point was thrown by the Transport endpoint is not connected, the VMs received an ioerror.
After switching on the nodes, Glaster recovered himself, without intervention from our side, and the process of treatment began.

But 4 out of 15 VMs could not get up. I saw errors on the hypervisor:

 2018.04.27 13:21:32.719 ( volumes.py:0029): I: Attaching volume vol-BA3A1BE1 (/GLU/volumes/33/33e3bc8c-b53e-4230-b9be-b120079c0ce1) with attach type generic... 2018.04.27 13:21:32.721 ( qmp.py:0166): D: Querying QEMU: __com.redhat_drive_add({'file': u'/GLU/volumes/33/33e3bc8c-b53e-4230-b9be-b120079c0ce1', 'iops_rd': 400, 'media': 'disk', 'format': 'qcow2', 'cache': 'none', 'detect-zeroes': 'unmap', 'id': 'qdev_1k7EzY85TIWm6-gTBorE3Q', 'iops_wr': 400, 'discard': 'unmap'})... 2018.04.27 13:21:32.784 ( instance.py:0298): E: Failed to attach volume vol-BA3A1BE1 to the instance: Device 'qdev_1k7EzY85TIWm6-gTBorE3Q' could not be initialized Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/ic/instance.py", line 292, in emulation_started c2.qemu.volumes.attach(controller.qemu(), device) File "/usr/lib64/python2.7/site-packages/c2/qemu/volumes.py", line 36, in attach c2.qemu.query(qemu, drive_meth, drive_args) File "/usr/lib64/python2.7/site-packages/c2/qemu/_init_.py", line 247, in query return c2.qemu.qmp.query(qemu.pending_messages, qemu.qmp_socket, command, args, suppress_logging) File "/usr/lib64/python2.7/site-packages/c2/qemu/qmp.py", line 194, in query message["error"].get("desc", "Unknown error") QmpError: Device 'qdev_1k7EzY85TIWm6-gTBorE3Q' could not be initialized qemu-img: Could not open '/GLU/volumes/33/33e3bc8c-b53e-4230-b9be-b120079c0ce1': Could not read image for determining its format: Input/output error

Hardly repay 3 nodes with sharding turned off

 Transport endpoint is not connected (107) /GLU/volumes/e0/e0bf9a42-8915-48f7-b509-2f6dd3f17549: ERROR: cannot read (Input/output error)

We also lose data, it can not be restored.

Gently extinguish 3 nodes with sharding, will there be data corruption?
Yes, but significantly less (coincidence?), Lost 3 disks out of 30.

Findings:

Heal of these files hangs endlessly, rebalance does not help. We come to the conclusion that the files that were actively recorded when the 3rd node was turned off were lost forever.
Never reboot more than 2 nodes in a 4 + 2 configuration in production!
How not to lose data if you really want to reboot 3 + nodes? P stop writing to the mount point and / or stop volume.
Replacing a node or brika should be done as soon as possible. For this, it is highly desirable to have, for example, 1-2 a la hot-spare bricks in each node for quick replacement. And one more spare node with bricks in case of a node's drop.

It is also very important to test disk replacement cases.

Departures of briks (disks):
17:20
Knock brik:

 /dev/sdh 1.9T 598G 1.3T 33% /export/brick6

17:22

 gluster volume replace-brick freezer sl051s:/export/brick_spare_1/freezer sl051s:/export/brick2/freezer commit force

One can see such a drawdown at the moment of brik replacement (recording from 1 source):

The replacement process is quite long, with a small write to cluster and default settings of 1 TB is treated for about a day.

Adjustable treatment options:

 gluster volume set cluster.background-self-heal-count 20 # Default Value: 8 # Description: This specifies the number of per client self-heal jobs that can perform parallel heals in the background. gluster volume set cluster.heal-timeout 500 # Default Value: 600 # Description: time interval for checking the need to self-heal in self-heal-daemon gluster volume set cluster.self-heal-window-size 2 # Default Value: 1 # Description: Maximum number blocks per file for which self-heal process would be applied simultaneously. gluster volume set cluster.data-self-heal-algorithm diff # Default Value: (null) # Description: Select between "full", "diff". The "full" algorithm copies the entire file from source to # sink. The "diff" algorithm copies to sink only those blocks whose checksums don't match with those of # source. If no option is configured the option is chosen dynamically as follows: If the file does not exist # on one of the sinks or empty file exists or if the source file size is about the same as page size the # entire file will be read and written ie "full" algo, otherwise "diff" algo is chosen. gluster volume set cluster.self-heal-readdir-size 2KB # Default Value: 1KB # Description: readdirp size for performing entry self-heal

Option: disperse.background-heals
Default Value: 8
Description: this heals can be used

Option: disperse.heal-wait-qlength
Default Value: 128
Description: This is what you can wait for.

Option: disperse.shd-max-threads
Default Value: 1
Description: Maximum number of parallel heals SHD If you don’t have any storage space, it’s possible to keep your bricks.

Option: disperse.shd-wait-qlength
Default Value: 1024
Description: This is what you can wait for

Option: disperse.cpu-extensions
Default Value: auto
Description: accelerate the galois field computations.

Option: disperse.self-heal-window-size
Default Value: 1
Description: Maximum number of blocks (128KB).

Stand out:

 disperse.shd-max-threads: 6 disperse.self-heal-window-size: 4 cluster.self-heal-readdir-size: 2KB cluster.data-self-heal-algorithm: diff cluster.self-heal-window-size: 2 cluster.heal-timeout: 500 cluster.background-self-heal-count: 20 cluster.disperse-self-heal-daemon: enable disperse.background-heals: 18

With the new parameters, 1 TB of data was consumed in 8 hours (3 times faster!)

The unpleasant moment is that the result is a larger brik than was

was:

 Filesystem Size Used Avail Use% Mounted on /dev/sdd 1.9T 645G 1.2T 35% /export/brick2

became:

 Filesystem Size Used Avail Use% Mounted on /dev/sdj 1.9T 1019G 843G 55% /export/hot_spare_brick_0

Need to understand. Probably, it's about inflating thin disks. With the subsequent replacement of the increased bricks, the size remained the same.

Rebalancing:
After expanding or shrinking (using the add-brick and remove-brick commands respectively), you need to adjust the data among the servers. In all non-replicated volume, all bricks should be replaced. In a replicated volume, it should not be up.

Shaping rebalancing:

Option: cluster.rebal-throttle
Default Value: normal
Description: Sets the maximum number of parallel file migrations during the rebalance operation. The default value is the maximum of [($ (processing units) - 4) / 2), 2] files to b
e migrated at a time. It will be possible to record the amount of (($ (processing units) - 4) / 2), 4]

Option: cluster.lock-migration
Default Value: off
Description: If you want a re-locks associated with a file during rebalance

Option: cluster.weighted-rebalance
Default Value: on
Description: When enabled, files will be selected. Otherwise, all bricks will have the same probability.

Comparing the record, and then reading the same fio parameters (more detailed results of the performance tests - in lichku):

 fio --fallocate=keep --ioengine=libaio --direct=1 --buffered=0 --iodepth=1 --bs=64k --name=test --rw=write/read --filename=/dev/vdb --runtime=6000

If interested, compare the speed of rsync to traffic to Gluster nodes:

It is seen that approximately 170 MB / s / traffic to 110 MB / s / payload. It turns out that this is 33% of additional traffic, as well as 1/3 of Erasure Coding redundancy.

The memory consumption on the server side with the load and almost without it does not change:

The load on the cluster hosts at maximum load on the Volyum:

Source: https://habr.com/ru/post/417475/

All Articles

Glusterfs + erasure coding: when you need a lot, cheap and reliable

What they wanted

Theory: what is dispersed volume?

What is a subvolume?

More articles: