⬆️ ⬇️

Automating the integrity check of a raid array on a Dell server

Hi,% Habrachitel%!



A few months ago, we had problems with a single virtual machine running on a Dell PowerEdge R720 server with ESXi 5.5. The reboot of this VM lasted quite long and caused a severe drop in performance on the host itself.

The lifecycle log on the server was filled with messages like:

PDR47

A block on Disk 0 in Backplane 1 of Integrated RAID Controller 1 was

punctured by the controller.



PDR64

An unrecoverable disk media on disk 0 in Backplane 1 of

Integrated RAID Controller 1.



Googling led to a disappointing conclusion: the raid array is damaged and cannot be recovered. Namely - the data related to one block (strip) was corrupted on several disks at once (double fault):



Fortunately, Della RAID controllers have the ability to continue to work, despite the inconsistent state of the array - puncture ( https://www.dell.com/support/Article/us/en/04/438291/EN#Unique-Hyphenated-Issue- Here-2 ), which allows you to save at least that part of the data that is not damaged. This, of course, does not in any way eliminate the need for subsequent replacement of disks and reassembly of the raid array from scratch.

To prevent such situations, Dell recommends running an array integrity check at least once a month. Alas, we found out about it too late.

')

You can run this check as a Dell OpenManage Server Administrator web interface ( http://www.dell.com/support/contents/us/en/19/article/Product-Support/Self-support-Knowledgebase/enterprise-resource- center / Enterprise-Tools / OMSA / ), as well as through omconfig / omreport utilities included in OMSA. And, if the developers from Dell did not “forget” to include these utilities in OpenManage for ESXi, then there would be no problems with automation, since it is clear that manual checking the integrity of the array on each server, absolutely not IT way. Not to mention that the OMSA interface is very slow and it’s still a pleasure to work with it.

The guys from Dell “did their best” and it is impossible to automate the check (for example, by opening a pre-prepared link in cURL) in a simple way, because the web interface is generated dynamically and there are no permanent links.



What to do?



I had to tinker a bit and write the verification utility myself. Meet the: Consistency Check Task Automation Tool for Dell Servers with iDRAC (https://github.com/jazzl0ver/dell_raid_cc). The utility is written using the CasperJS framework, which allows you to automate the work just with such dynamic sites.



To use dell_raid_cc you need:

1. Server with installed OMSA (see link above)

2. Download and install phantomjs (http://phantomjs.org/download.html)

3. Download and install casperjs (http://docs.casperjs.org/en/latest/installation.html)

4. Remove the utility from git:

git clone https://github.com/jazzl0ver/dell_raid_cc

5. Create a file with access parameters (for example, creds.txt):

export OMSAHOST=192.168.1.191

export OMSAPORT=1311

export USERNAME=root

export PASSWORD=password

export DELLHOST=192.168.1.30


6. Download it and you can run the utility or put its launch in crontab:

source creds.txt

casperjs --ignore-ssl-errors=true --cookies-file=/tmp/dell_raid_cc_cookie.jar dell_raid_cc.js



If everything is in order, the output will be something like this:

Found: Virtual Disk 0 [state: Ready; layout: RAID-10; size: 1,862.00GB]

CC for Virtual Disk 0 has been started

Found: Virtual Disk 1 [state: Ready; layout: RAID-1; size: 931.00GB]

CC for Virtual Disk 1 has been started



If you run it again, you can see the scan progress, for example:

Found: Virtual Disk 0 [state: Resynching; layout: RAID-6; size: 5,026.50GB]

CC for Virtual Disk 0 is still running, progress: 19% complete



It should be said that the utility does not support multi-controller systems (I just don’t have such and test, respectively, with nothing).



I hope the utility will be useful not only for me.



UPD. As the colleagues suggested in the comments, it is more correct to configure the launch of the integrity check on schedule using the megacli utility. For example:

./MegaCli -AdpCcSched -SetStartTime 20140822 04 -aALL


Instructions for installing on a server with CentOS / RedHat - here

CC schedule setting - here



Under ESXi it is also easy to install. You can install vib directly , or bundle it and put it as an update via vCenter.



UPD. # 2 Perc5 controllers do not support scheduling via MegaCli:

cd / opt / lsi / MegaCLI; ./MegaCli -AdpCcSched -Info -aALL



Adapter 0: Scheduled Chceck Consistency is not supported.



Exit Code: 0x01


For them, using dell_raid_cc is the only way to automate.

Source: https://habr.com/ru/post/279613/



All Articles