How I ruined the SSD in two months

Epigraph

“Never trust a computer that you cannot throw out of the window.”
Steve Wozniak

Two months ago I put an SSD disk in my laptop. He worked great, but last week he died suddenly due to the exhaustion of the cells (I suppose). This article is about how this happened and what I did wrong.

Description of the environment

User: Web developer. That is, in the course of such things as: virtualki, eclipse, frequent updates of repositories.
OS: Gentoo. That is, often "rebuilt the world."
FS: ext4. That is, a journal is being written.

So, the story begins in April, when, finally, I got around to copy partitions on a 64GB SSD broom, bought back in September. I deliberately do not inform the manufacturer and the model, for as yet I have not really figured out what happened, and it does not really matter.

The increase in performance was of course enormous: everything began to load twice as fast; but, most importantly, such a parasitic parameter as the access time disappeared. As a result, you can: “rebuild the world” in the background, launch three or four applications that actively interact with the disk; all this has virtually no effect on work. No upgrade of the processor will give this.
')

What did I do to make it work longer

Of course, I studied numerous publications about how to protect SSD-drives. And that's what I did:

Put noatime for sections so that when accessing the file, the record of the last access time is not updated.
Increased the RAM to the maximum and disabled the swap.

I did nothing more, because I believed that the computer should serve the user, and not vice versa, and excessive dancing with a tambourine is wrong.

SMART

Three days before the fall I was concerned with the question: how do I know how much happiness is enough for me? I tried the smartmontools utility, but it displayed incorrect information. I had to download Datasheet and write a patch for them.
Having written a patch, I dug up one interesting parameter: average_number_washing / maximal_number_washing = 35000/45000. But having read that the MLC cells can withstand only 10,000 cycles, I decided that these parameters did not mean exactly what I think, and scored on them.

Chronicle of the fall

Suddenly, while working, inexplicable things began to happen, for example, new programs did not start. For the sake of interest, I looked at the very SMART parameter, it was already 37000/50000 (+2000/5000 in three days). Restart already failed, the main partition file system could not be read.
I started from the compact and started checking. The check showed a lot of broken nodes. During the repair process, the utility began testing for bad sectors and marking them. This all ended the next day with the following result: 60GB of 64GB were marked as bad.

To the note: In SSD hard drives, a cell is considered to be a bat if new information cannot be recorded there. Reading from such a cell will still be possible. For this ale to run the badblocks utility in read-only mode, she is unlikely to find anything.

I decided to run the flashing utility, because it not only reflash, but also reformat the disk. The utility started formatting, grunted and issued that the reasonable allowable number of bad sectors was exceeded, and also that there were failures, so it is not possible to complete the formatting.
After that, the disk was defined as a disk with a very strange name, model number and size of 4GB. And, in the future, except for specialized utilities, no one sees it.
I wrote a letter in support of the manufacturer. They recommended that I reflash, if it does not work out, then return it to the seller. Warranty 2 more years, so try it.
I conclude this section with thanks to Steve Wozniak, who taught me to make periodic backups.

What happened

Honestly, I myself do not know. I assume the following: SMART did not lie and the cells really wore out (this is indirectly confirmed by the backup that I did two days before the fall, it showed when unpacking that the creation dates of some files were reset to zero). And when checking for sector troubles, the disk controller simply allowed to mark all the cells as broken, in which the permissible number of write cycles was exceeded.

What to do if you have SSD

Windows

Put Windows 7 in it, everything is optimized for such disks. Also put a lot of RAM.

Macos

Most likely, only those computers that will be immediately sold with SSD are optimized.

Freebsd

Put 9.0. Read tips for Linux, think about what you can do.

Linux

Put the kernel 2.6.33, in which there is an optimization for such disks in the form of a TRIM command.
Increase the memory so that you can safely disable the swap.
Put for mounted partitions noatime .
I used a file system made on the principle of copy-on-write or non-journaling file system (for example ext2).
Currently, copy-on-write FS is quite difficult to use. ZFS so far only works through FUSE. And nilfs and btrfs when mounting swear that their format has not yet been finalized.
Enable NOOP IO Scheduler it will allow not to perform unnecessary useless actions for SSD.
Conceptually true, but the disk will not help much - transferring temporary files to tmpfs .
For systems intensively writing to the log should be stored in another place. This is mainly true for servers for which the log server is easily raised.
Get SMART-utilities correctly displaying the status of the SSD-disk, so that you can periodically monitor the disk.
Just spare the drive. And for Gentushnikov, this additionally means not to “reassemble the world.”

Questions to the community

Is it possible to kill MLC cells in 2 months? Of course, I understand that I did not spare the drive, but I did not do anything supernatural, I just worked as usual.
Is this a warranty case?

UPD : I had a drive Transcend TS64GSSD25S-M.
UPD2 : The comments are very good reviews of Intel SSD and SAMSUNG. In addition, people wonder how you can kill an SSD broom so quickly. Believe me, I was also perplexed. Nevertheless, it is possible that this hastily tailored SSD series can be quickly killed.
UPD3 : In the comments and the next article suggest that I have a disk on the JMicron controller, that is, there is no cache and “if I had to change 4kb of data in a random place, they had to erase a whole block of 64-512kb each time”. I can add that I saw my drive on sale in Germany in March. So everyone has a chance to run into trouble.

PS In the meantime, I put the old broom and look in the direction of the Hitachi SSD or Intel X25-M.

UPD4 : The manufacturer acknowledged its problem with the controller and returned the money.
UPD5 : Moved to Intel X25-M 80G, happy as an elephant.

Source: https://habr.com/ru/post/96896/

All Articles