📜 ⬆️ ⬇️

Diagnostics servers FirstDEDIC

Automatic diagnostics - the first thing we do before selling Dedic .

If this is a new server, we check the correctness of the work and enter configuration information into the database.

If the server was already in operation, we check the health of the components and update the data in the database. The information on the site must correspond to what we sell. It happens that the previous client was replaced with disks, and nowhere is this indicated, the tariff has not changed. Then the next client runs the risk of getting 240 GB SSD instead of the stated 4000 GB HDD.
')
We take these risks into account. If you do not manually update the information, the system does it automatically for each new or released server. Boot over the network with the Linux kernel and runs a diagnostic program that:


What we check


CPU


For the CPU stress test, the mprime-bin program runs for 30 minutes.

/usr/bin/timeout 30m /opt/mprime -t /bin/grep -i error /root/result.txt 

Every minute ipmi sensors check the temperature of the processor, permissible - less than 60C. The program detects CPU architecture errors in the / proc / kmsg and mprime results.txt files.

Ram

Some memory cells may be damaged - you need to check each. Classic Memtest + will not work. In the free version, it does not save the results, only displays on the screen. Therefore, we use memtester. We start it from under OS, at the same time the cells which are not occupied by OS are checked.

 memtester `cat /proc/meminfo |grep MemFree | awk '{print $2-1024}'`k 5 

We look at the result of the query: if the memory is working properly, the program returns 0.

Storage

Does the program find all devices in / dev / sd? and / dev / cciss / c0d? and checks each item whether it is a disk or not.

 hdlist() { HDLIST=$(ls /dev/sd?) HDLIST="${HDLIST} $(ls /dev/cciss/c0d? 2>/dev/null)" REAL_HDLIST="" for disk in ${HDLIST}; do if head -c0 ${disk} 2>/dev/null; then REAL_HDLIST="${REAL_HDLIST} ${disk}" fi done echo "${REAL_HDLIST}" } 

Now you need to check all the drives.

HDD

- completely clear the hard drive from the data of the previous user:

 for DISK in $(hdlist) do echo "Clearing ${DISK}" parted -s ${DISK} mklabel gpt dd if=/dev/zero of=${DISK} bs=512 count=1 done if [ "($FULL_HDD_CLEAR)" = "YES" ]; then echo "Clearing disks full (very slow)" wget -O /dev/null -q --no-check-certificate "${STATEURL}&info=slowhddclear" for DISK in $(hdlist) do echo "Clearing ${DISK}" dd if=/dev/zero of=${DISK} bs=1M done fi 


The program evaluates the speed in three disk offsets: at the beginning, middle and end - each offset of 4 GB. This is enough to make a general conclusion. For each offset we use this function:

 sysctl -w vm.drop_caches=3 > /dev/null zcav -c 1 -s ${SKIP_COUNT} -r ${OFFSET} -l /tmp/zcav1.log -f ${DISK} if [ $? -ne 0 ]; then echo err exit fi SPEED=$(cat /tmp/zcav1.log | awk '! /^#/ {speed+=$2; count+=1}END{print int(speed/count)}') 

SSD

Check the value of smart attributes:

Media_Wearout_Indicator is the lifetime or wear of the disk: for the new - 100, the minimum allowable is 10.

Reallocated_Sector_Count - the number of reassigned sectors - must be less than 100.

RAID status

Identify the disk by RAID model and check the status of the array. If it is in operation, it will be “optimal”.

 detect_raid_type() { RAIDSTR=$(lspci | grep -i raid) if echo ${RAIDSTR} | grep -iq adaptec; then # THis is adaptec echo "adaptec" elif echo ${RAIDSTR} | grep -iqE 'lsi|megaraid'; then # THis is LSI echo "lsi" elif echo ${RAIDSTR} | grep -iq '3ware'; then # THis is 3ware echo "3ware" elif echo ${RAIDSTR} | grep -iqE 'Hewlett-Packard.*Smart'; then # THis is HP Smart Array echo "HP-SmartArray" elif dmesg | grep -q cciss/ ; then echo cciss else echo "unknown" fi } raid_status_adaptec() { RSTATUS=$(arcconf getconfig 1 ld | awk -F: '/Status of logical device/ {print $2}') if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then echo "${RSTATUS}" return 1 fi } raid_status_3ware() { echo "We have not support 3ware yet" return 0 } raid_status_lsi() { RSTATUS=$(megacli -LDInfo -Lall -aALL |awk -F: '$1 ~ /State/ {print $2}') if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then echo "${RSTATUS}" return 1 fi } raid_status_unknown() { echo "Unknown RAID" return 0 } raid_status_cciss() { RSTATUS=$(cciss_vol_status /dev/cciss/c*d0) if ! echo ${RSTATUS} | grep -q "OK" ; then echo "${RSTATUS}" return 1 fi } 

Network

Check the download speed over the network - should be more than 300 Mbps.

 curl -k --progress-bar -w "%{speed_download}" -o /dev/null "($CGI_MGR_URLv4)/speedtest_cgi?id=($AUTH_ID)&func=server.speedtest" 


Statistics

The diagnostic program checks an average of 323 servers per month, 124 of them do not pass the test - we do not sell these servers. First, the data center engineers change disks, repair coolers. CPU and RAM, we usually change the warranty.

Let's look at statistics on working HDD. For analysis, we took 1800 reports for different disks - a total of 103 models.
Attribute NameminExpected valuemaxStandard deviationDescription
Temperature_Celsius1425.81404.0925C - great temperature for a disc
Power_On_Hours407240335936312910Funny. Some discs have worked for 6 years
Reallocated_Sector_Ct092.349610728496100 is a good threshold
Raw_Read_Error_Rate0324169654294967295126899820.1All values ​​are large. At the slightest problem, there are a lot of errors on the trigger.
SSD Power_On_Hoursten23159918502134915More than two years - not bad

Excellent numbers, now let's check how much the HDD works on average. To do this, we compiled statistics on broken disks, focused on the Raw Read Error Rate.
Attribute NameminExpected valuemaxStandard deviationDescription
Power_On_Hours0250405717812030HDD works 33 ± 16 months. Large spread - difficult to draw conclusions

Statistics is an interesting thing, but not the main one. We do diagnostics not for the sake of numbers, but for customers: so that the data center has working servers, and the site has updated information. Then each client receives:

Source: https://habr.com/ru/post/322944/


All Articles