⬆️ ⬇️

Why backup? We have the same RAID





It is a custom to write success stories in corporate blogs - this has a positive effect on the company's image. Unfortunately, not always in the work of an engineer everything ends with a happy end.

It must be said that colleagues are already beginning to make fun of me, that I am “dragging on” problems. One way or another, I have participated in almost all problematic applications lately. And now I want to tell one instructive story from my practice.



The story began with the fact that I was asked to analyze the performance of a single-storage disk array, the “brakes” of which paralyzed the work of an entire branch. The initial situation is as follows:





According to the array performance logs, the “brakes” did not arise because of the increased load. I suspected that the problem was caused by a failed controller on a virtualized HP EVA storage system. Usually, performance problems are solved remotely, but in this case they decided to send the engineer to the site (at that time no one suspected that the trip would be delayed for two weeks).

')

And then during the performance analysis, a poltergeist began to appear: volumes from the array in the vSphere interface periodically display the wrong amount (from negative to dozens of petabytes), which the customer regarded as a problem in the array. At the same time access to consoles of part of virtual computers was lost, and other troubles arise. Even I already started to get nervous, and the customer was just in touch.



And here begins just fireworks problems.



We found an ESXi bug due to which incorrect volume sizes may be displayed. But it turns out that there is no official VMware support contract. Support is provided by a third-party company and only on weekdays, and it happens on Saturday.







For complete happiness, the firmware of the two servers of the three and switches (blade chassis) lags behind the firmware of the chassis control module, which can also lead to the most unexpected problems. Well, the cherry on the cake: on the SAN switches there are different versions of the firmware, and all before the major version (6.xx, when 8.0.x is available).



Finally, it turns out that in MS SQL Server Express the free space has run out, because of which a poltergeist appeared with the availability of VM consoles in vSphere and the size of the volumes was displayed incorrectly. So while administrators were solving database problems, we were trying to deal with storage.



After some actions, the main volume suddenly went offline.



We remembered a bug in the storage firmware versions 7.3, 7.4 and 7.5, because of which on compressed volumes after a certain number of hits, broken blocks may appear (in this situation neither RAID failover nor mirroring of volumes on the adjacent array can help, because the error is a higher level).



And here the most interesting nuance was revealed: it turns out that IBS has not been working for the customer for 3 months already. That is, there are backups, but they are not relevant, and recovering from them is the same as losing data.



We managed to translate the volume online (via the array CLI), but at the very first attempt of the host to write something, it fell again. We turned off all the datastores on the servers and spent the next day in the office, almost without breathing, copying all the virtual machines where it would turn out - on the servers, USB-drives and PC.



As a result, we managed to save all the data, except for VM, where we launched snapshots consolidation, as in the process of consolidation, LUN went offline, and instead of VM data, there was “porridge”. Under the law of meanness, this turned out to be VM of electronic document circulation. In addition, to eliminate various risks, we had to upgrade almost the entire infrastructure — VMware, Brocade, HP Blade, and so on.



Causes of disaster



What conclusions can a dear reader draw from this story in order not to be in a similar situation?



  1. The storage system was designed incorrectly . One volume on ~ 12 TB will not work normally on any classic storage system. Always break the total capacity into volumes of the order of 1-2 TB. Yes, there will be less useful capacity, but there will be much less chances to open an application “everything is slowing down here.” UPDATE : On many block access storage systems, performance problems can occur with intensive access by many hosts to one moon (for example, virtualization clusters). In such situations, it is better to split the volumes into smaller ones so as not to rest on the command queue on the LUN.
  2. Firmware has never been updated . This is not the only story when a bug in the old firmware led to downtime or data loss. Yes, the new firmware also has bugs, but no one forces you to use the bleeding edge. Use stable, recommended versions.
  3. Backups How many requests and recommendations there were to do and check backups - do not count. I do not want to repeat myself , but ALWAYS MAKE AND TIMELY CHECK THE BACKUP . In this story, it was possible to reduce the downtime by at least twice if the IBS was maintained in working condition.







  4. There was no vendor support equipment . We have excellent specialists, with a deep knowledge of the equipment, but there are situations when only a vendor can help.
  5. The free space in the database was not monitored . Watch for free space not only on disks, but also in the database.


Thank you for your attention, you work without fail.



Alexey Trifonov Tomatos , Engineer, Jet Infosystems Service Center.

Source: https://habr.com/ru/post/335618/



All Articles