📜 ⬆️ ⬇️

A bug in Linux 5.1 caused data loss - a corrective patch was already released

A couple of weeks ago, in the version of the Linux 5.1 kernel, we discovered a bug that led to data loss on the SSD. Recently, developers have released a corrective patch for Linux 5.1.5, which patched the "gap".

We discuss what was the reason.


/ Unsplash / glen carrie
')

What a bug


At the beginning of the year, developers made a number of changes to the Linux 5.1 kernel. After that, on systems with SSD from Samsung, which use dm-crypt / LUKS encryption with device-mapper / LVM, an error leading to data loss began to appear . But it became known about the problem only in the middle of May - at the same time it began to be actively discussed on thematic forums .

At least two people who have encountered a bug are known: LKML mailing list member Michael LaĂź, who first reported the problem , and ArchLinux user .

Michael ran the fstrim command, which tells the drive which data blocks are no longer used for the mounted btrfs volume. After he received the following system messages:

attempt to access beyond end of device sda1: rw=16387, want=252755893, limit=250067632 BTRFS warning (device dm-5): failed to trim 1 device(s), last error -5 BTRFS warning (device dm-5): csum failed root 257 ino 16634085 off 21504884736 csum 0xd47cc2a2 expected csum 0xcebd791b mirror 1 

After that, he discovered that the btrfs volume was damaged, and the remaining logical volumes on the physical device were destroyed.

In the case of the ArchLinux user, the problem touched LUKS cryptographic protection. After rebooting the operating system and executing fstrim, the LUKS headers (which are used to search for volumes) turned out to be unreadable, which did not allow decrypting the encrypted data.

What is the reason


The problem was the device mapper (DM) subsystem, whose task is to create virtual block devices. It is used to implement the LVM logical volume manager, software RAID, and dm-crypt disk encryption system.

“The fstrim team marked too many blocks at a time without taking into account the max_io_len_target_boundary limit. As a result, those memory segments that are still in use were freed up, ”commented Sergey Belkin, head of the development department at 1cloud.ru . “Since the error was related to the device mapper, in theory, data loss could occur on any file system.”

Patch


A patch for the bug kernel developers released in late May. Only four lines in drivers / md / dm.c have been changed . Corresponding changes were also made to the upcoming Linux kernel 5.2 (added and removed lines are marked with “+” and “-”, respectively):

 @@ -1467,7 +1467,7 @@ static unsigned get_num_write_zeroes_bios(struct dm_target *ti) static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti, unsigned num_bios) { - unsigned len = ci->sector_count; + unsigned len; @@ -1478,6 +1478,8 @@ static int __send_changing_extent_only(struct clone_info *ci, struct dm_target * if (!num_bios) return -EOPNOTSUPP; + len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti)); + __send_duplicate_bios(ci, ti, num_bios, &len); ci->sector += len; 

The patch has already been applied by the ArchLinux / Manjaro and Fedora distribution kits . The Ubuntu distribution did not affect the error, since it was not translated to the Linux version of Linux 5.1.


/ Flickr / Andy Melton / CC BY-SA

You can eliminate the situation with data loss without installing a patch. It is enough to disable the fstrim.service / timer service using the commands:

 systemctl disable fstrim.timer systemctl stop fstrim.timer 

Another option is to rename the fstrim executable file or remove the discard flag when fstab is mounted. You can also turn off the mode allow-discards in LUKS through dmsetup. However, all these methods are nothing more than temporary and do not solve the essence of the problem.

Not the first time


This is not the first time that a commit in the Linux kernel leads to situations with memory corruption. A similar story happened in Linux version 4.19 - then the BLK-MQ I / O schedulers were to blame. The problem manifested itself when building a kernel with the CONFIG_SCSI_MQ_DEFAULT = y option set to the default. In some cases, the data volume was damaged.

 sed: error while loading shared libraries: /lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107 sed: error while loading shared libraries: /lib/x86_64-linux-gnu/libattr.so.1: unexpected PLT reloc type 0x00000107 

Most often, the problem manifested itself with EXT4, but in theory it could affect other file systems.

Then one of the kernel maintainers prepared a small fix that solved the problem. However, this same bug was later found in the Linux 4.20 build. Finally, we managed to get rid of it at the end of December 2018 with a new global update.

Our additional resources and sources:

File backup: how to hedge against data loss
Risk minimization: how not to lose your data
Backup & Recovery: in-line and smart deduplication, snapshots and secondary storage
How to save using the application programming interface
DevOps in the cloud service on the example of 1cloud.ru
Evolution of cloud architecture 1cloud

How everything is arranged: digest from 1cloud
Potential HTTPS attacks and how to protect against them.

Source: https://habr.com/ru/post/454978/


All Articles