📜 ⬆️ ⬇️

Why does the speed of writing to the RAID decrease as the SSD fills, or why do we need TRIM

This problem is most relevant for hardware RAID or firmware RAID (such as Intel RST RAID 1/10/5/6) with non-industrial SSDs.

SSD feature


SSDs write and read data in pages, you can only write to cleaned pages, and you can clear pages only in big blocks. For example, a disk has a page size of 8 KB, the block contains 128 pages, so the block size is 1024 KB (hereinafter, unless otherwise indicated, the KB and MB are binary).

For example, if you change 40 KB in one file, then at the physical level it will look like this:
')


At the logical level, everything looks as usual - the data will be overwritten on top of the corresponding sectors. As soon as in block 1 there are only empty and ready-to-clear pages, this block is erased and becomes empty entirely.

To hide the physical implementation, the disk supports a map of logical and physical page numbers (Flash Translation Layer).

So far we see only one way that the physical page can become cleansed again - if new data is written to its logical address. The fact is that the disk controller works at the page level and does not know anything about the file system, and the operating system does not notify the disk about the deleted files which sectors can be cleared. It is easy to see that sooner or later each page of the disk will be occupied and it will have nowhere to write data.

To solve this problem, the ATA TRIM command ( wiki ) was added. The operating system sends it to the disk with the indication of the sectors that can be cleared. The analogs of this command are SCSI UNMAP and CF ERASE. Unfortunately, in some cases it is not possible to send it to the disk:
- if the disk is in RAID with a hardware controller ( LSI , Adaptec , etc.),
- if the disks are in firmware-RAID, in particular, Intel RST RAID 1/10/5/6,
- if the disk is connected via USB (protocol limitation),
- if the disk is encrypted programmatically via TrueCrypt, dm-crypt, GELI, etc. (may be supported, but usually not included for security reasons).

If, as a result of testing, it turns out that the disk does not receive a TRIM command, then very few free pages may remain for recording. But they will be: each disk contains some reserved area, which serves as a reserve of free pages and a reserve of blocks for the replacement of completely worn out. To find out the size of this area, you need to see how much physical memory is installed on the disk and how much LBA is indicated in the documentation.

For example, Samsung SSD 840 Pro 512 GB has 512 GB of memory, while 1000215216 LBA sectors are available. The reserve is: 512 Ă— 1024 Ă— 1024 Ă— 1024 - 1000215216 Ă— 512 = 35 GB or 6.85%. The available disk capacity in this case is (512 - 35) Ă— 1024 Ă— 1024 Ă— 1024 = 512 Ă— 10 ^ 9 = 512 GB, already decimal. Samsung SSD 850 Pro 128 GB has 129 GB onboard, 128 GB decimal is available to the user, the reserve is 7.6%.

So, if we fill the entire disk, then delete all the files, then without the support of TRIM, the disk can write only to some part of 6.85% of the disk volume. A part, because this reserve will partly consist of not completely empty blocks due to fragmentation. The presence of this area allows you to somehow continue to overwrite files on disk.

An example of a worse situation: there is nowhere to write, although the volume of pages occupied does not exceed the amount available to the user without a reserve.



In this case, along with the recording, the garbage collector works, which will read the block into RAM, erase the block on the disk (long operation, erasing takes 3000 µs compared to 900 µs of writing to a blank page) and write the block from the RAM. The delay also occurs due to the growth of Write Amplification - there are 5-10 physical write operations per logical write operation.

Therefore, the more disk there is free space for maneuvers, the higher the recording speed. The garbage collector in the background is not only cleaning and defragmenting blocks, but also uniformly distributing write / clear cycles (P / E) among the blocks so that they wear out in the same way.

There is a popular myth that modern disks have such a good garbage collector that they don’t need TRIM. This is absolutely not true; the garbage collector and the TRIM solve different problems.

Industrial drives often have 50% and more backup areas, so the absence of TRIM is not critical for them. The remaining disks most often do not have a clearly stated reserve area, or it is insufficient. Tests show that over-provisioning 25 to 29 % of the total physical memory (including the spare area) has a good effect. Therefore, if the disk does not have enough spare area, then you need to do over-provisioning yourself.

There are three ways:
- mark the disk in such a way as to leave some part of the unallocated area after creating the RAID,
- use the ATA command to create the Host protected area ( howto ), before creating the RAID,
- configure the RAID controller so that it uses only part of the disk capacity.



Before allocating a free area, you need to let the disk know that this area is not occupied with anything, in one of two ways:
- connect the drive to another controller and send an ATA TRIM command (or using O & O Defrag - there is a cli interface, Windows 8 built-in disk optimizer or Anvil's Storage Utilities),
- do a full cleaning of the FTL table by sending the ATA Secure Erase command.

There is a version that you can also make the disk understand that the blocks are not used if you write 0x00 or 0xFF there (the so-called “Tony TRIM” method). Perhaps for some controllers it works, but my tests showed no changes.

On practice


I have two Samsung SSD 540 Pro 512 GB disks in Intel RST RAID 1, on which Windows 8.1 is installed. After a year of work, I measured the performance and was unpleasantly surprised. After checking TRIM, I saw that it was not working.

- Check TRIM under Windows
- Check TRIM for Linux:
# lsblk -D 
The DISC-GRAN and DISC-MAX columns should both be greater than 0 for all participating components.

Alternative option:
 # dd if=/dev/urandom of=tempfile bs=1M count=3 # hdparm --fibmap tempfile # hdparm --read-sector [ADDRESS] /dev/sda # rm tempfile && sync && sleep 120 # hdparm --read-sector [ADDRESS] /dev/sda 
After deleting a file, the disk should be 0x00 or 0xFF, but this is not a valid method: different disks behave differently.

TRIM in real time is enabled with the “discard” option when mounting a disk:
 # grep -i discard /etc/fstab # mount | grep -i discard 

TRIM for single disk file systems and LVM is supported with Linux kernel 2.6.33. TRIM for mdraid is supported from the Linux 3.7 kernel . But it can also be ported to older kernel versions, for example, supported on CentOS 6 .

By default, Ubuntu makes TRIM scheduled once a week using fstrim, but only for single (not mdraid) disks from the following manufacturers: Intel, Samsung, OCZ, SanDisk, and Patriot, and if hdparm is installed.

- TRIM check in FreeBSD:
ZFS, by default, supports TRIM since FreeBSD 9.2:
 # sysctl -a | grep -i 'zfs.*trim' 

GEOM RAID gmirror supports TRIM with FreeBSD 9.1:
 gstat -d 
column "d / s" - BIO_DELETE / second.

Intel RST RAID only supports TRIM for RAID 1 type, as an exception: to enable, you need to make sure that the Intel RST driver is version 11 and above, and the OROM (Legacy boot) or SataDriver (UEFI boot) version 11 or higher firmware, or the old version, but patched . TRIM is supported in Intel RSTe RAID 0/1/10 since version 3.7.0.1093.

I decided to create an unallocated disk partition for over-provisioning.
1. With Acronis Backup, I took a disk image. Also saved the partition table (it is important to have the first sector, last sector, GPT partition type, GPT unique identifier, partition name).

2. Rebooted in BIOS and made SSD Secure Erase. If this item is not in the BIOS, then you can execute the command using hdparam (or here and here , is under Windows), HDDErase or HDAT2 .

3. Assembled RAID 1 on two disks.

Here it is necessary to make an important note: when the array is initialized, the RAID controller reads each sector from one disk and writes it to the second. Theoretically, this should prevent our entire undertaking, and there will be no over-provisioning on one disk. But tests have shown that for some reason this method works. I have no explanation for this.

4. I booted from the LiveCD and using GPT fdisk created the necessary partition table: the last partition is 104 GB less than before. Sections must be aligned (partition align) to the size of the disk page, and not to the block size.

5. Restored from the backup each partition.

After that, I completely filled out the disk and ran the tests. This should show the worst case. The Windows cache is turned on, regular write cache is turned off, Inter RST write-back is turned off, all tests use a fixed-size disk area of ​​40 GB. Testing discs is not easy, since the indicators may vary over time. Below are the steady state indicators.

I will compare three states:
- One disk without RAID, fully populated, the standard hidden backup area of ​​6.58%.
- One disk without RAID, after running TRIM on it free space.
- Two disks in RAID 1, fully populated, standard hidden backup area of ​​6.58%.
- Two drives in RAID 1, fully populated, over-provisioning 27.24% (including hidden backup area).


Latency and standard deviation

Table


Analysis of the results:
- Reading from RAID 1 is faster than from one disk, despite the fact that we only have firmware RAID.
- the record is faster, the more unallocated space: in the first place TRIM, in the second - our home-made over-provisioning.

The steady state is not always reached quickly. Let's see the latest configuration test (over-provisioning 27.24%) over time and see the worst case scenario:





A curious process takes place the first 400 seconds, after which the performance increases and stabilizes. I think, in parallel with the record, the garbage collector works, which defragments the blocks and prepares them for writing. This behavior is observed not every time, but from time to time. It can be seen that sequential recording sags up to 70 MB / s, random recording - up to 18000 IOPS. These indicators are still twice as good as without over-provisioning (32 MB / s and 7139 IOPS, respectively). To make sure that the steady state actually has such high performance, I also ran the test for 30 minutes, while writing 490 GB of disk with an average of 69,721 IOPS.

You can compare our results with colleagues and choose the optimal over-provisioning size:

Briefly


- If the disk receives ATA TRIM from the OS, then there is nothing to worry about, it is enough to leave part of the disk space free.
- If expensive industrial disks are used, then check the volume of the built-in backup area, if it is sufficient, then there will be no problems with the recording.
- In other cases, you need to leave an unmarked area, the larger its size, the smaller the standard deviation of the recording latency.
- Sometimes the garbage collector does not have time to prepare clean blocks and the write speed may sink and be intermittent.
- After over-provisioning, the steady-state maximum write speed increased from 7000 to 68000 IOPS, and the average minimum - from 6000 IOPS to 19000 IOPS.

Source: https://habr.com/ru/post/242199/


All Articles