📜 ⬆️ ⬇️

IBM and a hectic weekend

I want to share a story about the breakdown of the IBM DS3000 disk system, connected by two optical controllers to the server, and the IBM EXP3000 basket connected to the DS3000 by SAS cables.
It all began on Friday evening, when the burned out power strip on one of the racks knocked out as many as 5 screws from the 10th raid on the EXP3000, then I mentally said goodbye to the stored data and said hello to the work weekend.
Disconnecting / connecting screws is not beneficial. Then I began to sort out the problem and sort through the options:
- reset the controllers with the sysWipe command. For the first time, the optics on the 1st controller did not rise, and the 2nd one was no longer accessible by the management port. The second time, the optics went up, but both controllers fell off at the management ports.
- pulled out of battery controllers. The previous problems remained, but new ones were added ... together the controllers ceased to work, an error hangs on the second after the initialization, and one by one they work wonderfully (of course not great, but at least they don’t write errors).
- for every fireman updated the BIOS of the server, network cards, IBM-ovsky software.
- I tried to run the baskets without screws. It helped! 8) Aliluy! The baskets were loaded and became available for management ports, but it is worth inserting at least one hard drive, all the errors are repeated at once. I am already glad that I have localized the problem!
In the meantime, it was already 8 pm Sunday. I decided to score everything, and restore the database to a test machine so that on Monday, at least in emergency mode, the users worked.
On Monday, talked about this with the support of IBM. There I was prompted by a wise and key thing to solve the trawl! Insert one new, not where not yuzany screw into an empty basket. The fact is that an error was recorded on the old screws, which did not allow the basket to be safely loaded.

In principle, broke and broke, anything can happen. BUT! I don’t understand how the industrial disk subsystem can go to full down due to a power outage, dragging the second basket (after all, until the DS3000 worked before resetting), and even after a full resetting, crash because some error hangs on the screws ...

')

Source: https://habr.com/ru/post/75853/


All Articles