Diagnostics servers FirstDEDIC

Automatic diagnostics - the first thing we do before selling Dedic .

If this is a new server, we check the correctness of the work and enter configuration information into the database.

If the server was already in operation, we check the health of the components and update the data in the database. The information on the site must correspond to what we sell. It happens that the previous client was replaced with disks, and nowhere is this indicated, the tariff has not changed. Then the next client runs the risk of getting 240 GB SSD instead of the stated 4000 GB HDD.
')
We take these risks into account. If you do not manually update the information, the system does it automatically for each new or released server. Boot over the network with the Linux kernel and runs a diagnostic program that:

collects data about new disks and enters into the database, from there they are uploaded to the site;
detects server malfunctions.

What we check

CPU

CPU temperature
The correctness of the processor.

For the CPU stress test, the mprime-bin program runs for 30 minutes.

/usr/bin/timeout 30m /opt/mprime -t /bin/grep -i error /root/result.txt

Every minute ipmi sensors check the temperature of the processor, permissible - less than 60C. The program detects CPU architecture errors in the / proc / kmsg and mprime results.txt files.

Ram

Some memory cells may be damaged - you need to check each. Classic Memtest + will not work. In the free version, it does not save the results, only displays on the screen. Therefore, we use memtester. We start it from under OS, at the same time the cells which are not occupied by OS are checked.

 memtester `cat /proc/meminfo |grep MemFree | awk '{print $2-1024}'`k 5

We look at the result of the query: if the memory is working properly, the program returns 0.

Storage

Does the program find all devices in / dev / sd? and / dev / cciss / c0d? and checks each item whether it is a disk or not.

 hdlist() { HDLIST=$(ls /dev/sd?) HDLIST="${HDLIST} $(ls /dev/cciss/c0d? 2>/dev/null)" REAL_HDLIST="" for disk in ${HDLIST}; do if head -c0 ${disk} 2>/dev/null; then REAL_HDLIST="${REAL_HDLIST} ${disk}" fi done echo "${REAL_HDLIST}" }

Now you need to check all the drives.

HDD

- completely clear the hard drive from the data of the previous user:

 for DISK in $(hdlist) do echo "Clearing ${DISK}" parted -s ${DISK} mklabel gpt dd if=/dev/zero of=${DISK} bs=512 count=1 done if [ "($FULL_HDD_CLEAR)" = "YES" ]; then echo "Clearing disks full (very slow)" wget -O /dev/null -q --no-check-certificate "${STATEURL}&info=slowhddclear" for DISK in $(hdlist) do echo "Clearing ${DISK}" dd if=/dev/zero of=${DISK} bs=1M done fi

check the value of the smart attribute Reallocated Sectors Count - must be at least 100,
check the speed of the disk.

The program evaluates the speed in three disk offsets: at the beginning, middle and end - each offset of 4 GB. This is enough to make a general conclusion. For each offset we use this function:

 sysctl -w vm.drop_caches=3 > /dev/null zcav -c 1 -s ${SKIP_COUNT} -r ${OFFSET} -l /tmp/zcav1.log -f ${DISK} if [ $? -ne 0 ]; then echo err exit fi SPEED=$(cat /tmp/zcav1.log | awk '! /^#/ {speed+=$2; count+=1}END{print int(speed/count)}')

SSD

Check the value of smart attributes:

Media_Wearout_Indicator is the lifetime or wear of the disk: for the new - 100, the minimum allowable is 10.

Reallocated_Sector_Count - the number of reassigned sectors - must be less than 100.

RAID status

Identify the disk by RAID model and check the status of the array. If it is in operation, it will be “optimal”.

 detect_raid_type() { RAIDSTR=$(lspci | grep -i raid) if echo ${RAIDSTR} | grep -iq adaptec; then # THis is adaptec echo "adaptec" elif echo ${RAIDSTR} | grep -iqE 'lsi|megaraid'; then # THis is LSI echo "lsi" elif echo ${RAIDSTR} | grep -iq '3ware'; then # THis is 3ware echo "3ware" elif echo ${RAIDSTR} | grep -iqE 'Hewlett-Packard.*Smart'; then # THis is HP Smart Array echo "HP-SmartArray" elif dmesg | grep -q cciss/ ; then echo cciss else echo "unknown" fi } raid_status_adaptec() { RSTATUS=$(arcconf getconfig 1 ld | awk -F: '/Status of logical device/ {print $2}') if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then echo "${RSTATUS}" return 1 fi } raid_status_3ware() { echo "We have not support 3ware yet" return 0 } raid_status_lsi() { RSTATUS=$(megacli -LDInfo -Lall -aALL |awk -F: '$1 ~ /State/ {print $2}') if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then echo "${RSTATUS}" return 1 fi } raid_status_unknown() { echo "Unknown RAID" return 0 } raid_status_cciss() { RSTATUS=$(cciss_vol_status /dev/cciss/c*d0) if ! echo ${RSTATUS} | grep -q "OK" ; then echo "${RSTATUS}" return 1 fi }

Network

Check the download speed over the network - should be more than 300 Mbps.

 curl -k --progress-bar -w "%{speed_download}" -o /dev/null "($CGI_MGR_URLv4)/speedtest_cgi?id=($AUTH_ID)&func=server.speedtest"

Statistics

The diagnostic program checks an average of 323 servers per month, 124 of them do not pass the test - we do not sell these servers. First, the data center engineers change disks, repair coolers. CPU and RAM, we usually change the warranty.

Let's look at statistics on working HDD. For analysis, we took 1800 reports for different disks - a total of 103 models.

Attribute Name	min	Expected value	max	Standard deviation	Description
Temperature_Celsius	14	25.81	40	4.09	25C - great temperature for a disc
Power_On_Hours	407	24033	59363	12910	Funny. Some discs have worked for 6 years
Reallocated_Sector_Ct	0	92.3496	10728	496	100 is a good threshold
Raw_Read_Error_Rate	0	32416965	4294967295	126899820.1	All values are large. At the slightest problem, there are a lot of errors on the trigger.
SSD Power_On_Hours	ten	23159	918502	134915	More than two years - not bad

Excellent numbers, now let's check how much the HDD works on average. To do this, we compiled statistics on broken disks, focused on the Raw Read Error Rate.

Attribute Name	min	Expected value	max	Standard deviation	Description
Power_On_Hours	0	25040	57178	12030	HDD works 33 ± 16 months. Large spread - difficult to draw conclusions

Statistics is an interesting thing, but not the main one. We do diagnostics not for the sake of numbers, but for customers: so that the data center has working servers, and the site has updated information. Then each client receives:

server of the required capacity - payment according to the tariff;
reliable equipment - no interruption in the work of projects.

Source: https://habr.com/ru/post/322944/

All Articles

Diagnostics servers FirstDEDIC

What we check

More articles: