📜 ⬆️ ⬇️

SSDs have slack

Data storage technology is a separate topic. Not so long ago, we indirectly touched on it in our material on the management of server disk space .

Today we will talk about how the Algolia search service team tried to solve a sudden problem with SSD disks.


/ photo by Aaron und Ruth Meder CC
')
Algolia’s engineers working on search technology are faced with the problem of indexing the search API. During this incident, requests to the new API were correctly redirected to other cluster machines, but there was no clarity about what happened.

Indexing was controlled by the supervise process, but the problem was not in its looping. The file system was read-only. The most interesting thing is that a similar incident reproduced itself on other machines in exactly one day. Thus, software that participates in the work of the storage system stack and its recent changes has fallen under the scope of analysts.

The essence of the problem and possible solutions could be very different, but the Algolia team focused on the following versions:


Analysis of the next machine showed the absence of parts of files. the date the files were changed and the size remained the same, just some of their sections were filled with zeros. Small files were completely erased.

The system memory areas were unavailable, and the engineers took to work the version with the ext4 bug. The kernel change log showed the presence of a huge number of bugs that could adversely affect the servers. The probability that the bug crept into ext4 was not completely excluded, but it was almost zero. Further the version with mdadm was checked. A review of the change log convinced engineers that the root of the problem is definitely not here.

The machines continued to “die” more and more often, and Algolia continued to improve the recovery procedure until it became clear that the only difference was SSD, but they were all from the same manufacturer. Transferring all the machines to identical software led to the usual damage to the files, but now it was safe to say that the problem was with the disks.

In the course of analytical work, engineers found that data was always lost in the amount of 512 bytes (which equals one disk block). What can reset the block? TRIM . To test the theory, TRIM was disabled on all servers. It was this move that led to the solution of the problem, but only for a while. One month after identifying the problem, one server restarted and loaded with corrupted data.

Delving into the source code of the kernel, looking for code at least somehow connected with TRIM, the engineers came across a blacklist of TRIM. This blacklist configures specific behavior for certain types of SSD drives, identifying the drive by name using regular expressions.

The system forced TRIM to erase empty blocks, the command was incorrectly interpreted by the disk, as a result, the controller erased those blocks that should not have been. That is why in some files there appeared blocks with 512 bytes of zeros, and files smaller than this size were erased completely. The problem did not arise because of the Queued TRIM command (the usual TRIM commands were used on the company's disks).

As a result, the supplier of the disks was informed and informed the representatives of Samsung about the problems. Algolia replaced the drives with others and does not recommend using any SSDs that are identified by the Linux kernel as bad.

Non-working SSD:


Operating SSDs:


Final of the story: “ Samsung 's solid-state drives are justified. The problem was in the Linux kernel

PS We share not only our own experience with the 1cloud virtual infrastructure service , but also the experience of Western experts in our blog on Habré. Do not forget to subscribe to updates, friends!

Source: https://habr.com/ru/post/262257/


All Articles