A couple of stories about RAID'ersky lawlessness

On the air the continuation of our Friday heading about failures, failures and other fakapy. If you missed our previous stories, here are the links: one , two , three . And today we will tell about the trouble with RAID in one "small but proud" data center.

History of inconsistency data

The story began with the fact that one of the disks in RAID 5 failed. Well, it came out and went out - a common thing. The disk began to reassemble, and then another disk failed. Too frequent problem for RAID 5. As a result, the entire disk pool in which this mdisk was located fell off. Who does not know, mdisk is a small raid, and the disk pool consists of a heap of mdisks. We decided to switch to the backup data center. Everything went smoothly: the servers started up normally, there are no errors. Everything seems to work. While the data was in the backup data center, we reassembled the failed mdisk in the main data center. There are no errors on it: the array glows green, the data is replicated between the arrays at the main and backup sites.

We switch back and we understand that some of the data we start normally, and, for example, database servers - with errors. We see them inconsistency data.
')
We check for integrity and find a bunch of errors. Strange, because the data in the RZOD was in order, and during reverse replication in the main data center, they came to have failed. Unclear. At first they thought that the problem was in the array. But in the logs everything is clean, no errors.

Then they began to sin on the replication procedure, because on the arrays in the primary and backup data centers the firmware versions differed by 2-3 minor versions. In the description of the vendor found that there are indeed replication errors with different versions of firmware. Then, with the help of a special proprietary utility, the replication consistency was checked: whether all the frames that had flown out of one array reached the other. Everything works without problems, but as soon as we transfer data from the backup data center to the main one, we get inconsistency - some blocks come broken.

They just began to copy data over the network. Make a dump of the base and fill the array. The dump comes with other hash sums. Probably somewhere on the road data is lost. We tried to send through different networks - everything is the same. It turns out that the problem is in the array, despite the fact that he cheerfully reports on his trouble-free work.

We decided to rebuild the disk pool. The data that worked in the main data center was transferred back to the backup, the disk array was freed, the entire pool was formatted and reassembled. And only after that the inconsistency of the data disappeared.

The reason was in one failed element of the logic of the RAID controller. One of the mdisks, that is, the RAID group, did not work correctly after the rebuild. The array considered it to be functioning normally, but this was not the case. When blocks fell on this mdisk, he wrote them down incorrectly. For example, file servers do not write much, and inconsistency does not occur on them. And in the database information is constantly changing, the blocks are recorded frequently and in various places of the disk pool. Therefore, we encountered the problem of inconsistency on the server with databases.

Since the failure of the disk pool to the launch of the system, it took about 4 hours, and all this time one of the critical business systems did not work for the customer. Financial losses were not so great, mostly it was a blow to the reputation.

In our practice it was the only such failure. Although it is not uncommon for RAID 5, when one of the disks dies, the load on the remaining ones increases, and because of this, another disk dies. So the tip: update the firmware in a timely manner and avoid inconsistency in the versions.

Inexplicable connection

Another amazing story that happened to us for the first time. Actors: the same arrays and platforms, there are also main and backup data centers. In the RODS, it crashes, freezes, and one of the controllers in the array does not rise. Overload - the controller has risen, the array has earned. But at this very moment on the main site we fell off the input / output on the physical Linux-servers.

Surprisingly, the controller in the RODS and the I / O in the OZOD are completely isolated from each other. The only thing that binds them is the FC routers on the perimeter. They simply wrap FC traffic in IP and drive it between sites to replicate data between arrays of disk storage. That is, the controller and I / O have absolutely everything different - SAN, physical platforms, physical servers.

It turned out that at the moment when the controller was turned on at the backup site, he considered himself to be the SCSI initiator, instead of being a SCSI target. At us replication goes from the main TsODA to reserve. That is, the controller in theory should be a SCSI target, all frames should come to it. And he decided that an active life position was more suitable for him, and began to try to send some data himself.

At this moment, the multipathing driver on Linux servers running Red Hat 7 did not work correctly. He took these commands verbatim. Despite the fact that he himself is the initiator, he saw another initiator and decided, just in case, to turn off all the paths to the disks. And since they were boot disks, they just fell off. Literally four minutes. Then they rose, but the customer had a short-term drawdown of business transactions. That is, within four minutes the customer-retailer could not sell their products throughout the country. And with two thousand retail outlets, every minute of idleness is expressed in a decent monetary equivalent, not to mention reputational losses.

Perhaps the cause of this incident was the imposition of bugs. Or maybe the two technologies are simply not friendly: the piling of disk storage and the multi-pass driver in Red Hat, which behaved in a strange way. After all, even if there is another SCSI initiator, the driver simply has to say that now it is also the initiator, and continue to work, and not shoot the disks. This should not be.

That's all. Less bugs to you!

Alexander Marchuk, Chief Engineer of the Jet Info Systems Service Center

Source: https://habr.com/ru/post/351826/

All Articles

A couple of stories about RAID'ersky lawlessness

History of inconsistency data

Inexplicable connection

More articles: