Automatic diagnostics - the first thing we do before
selling Dedic .
If this is a new server, we check the correctness of the work and enter configuration information into the database.
If the server was already in operation, we check the health of the components and update the data in the database. The information on the site must correspond to what we sell. It happens that the previous client was replaced with disks, and nowhere is this indicated, the tariff has not changed. Then the next client runs the risk of getting 240 GB SSD instead of the stated 4000 GB HDD.
')
We take these risks into account. If you do not manually update the information, the system does it automatically for each new or released server.
Boot over the network with the Linux kernel and runs a diagnostic program that:
- collects data about new disks and enters into the database, from there they are uploaded to the site;
- detects server malfunctions.
What we check
CPU- CPU temperature
- The correctness of the processor.
For the CPU stress test, the mprime-bin program runs for 30 minutes.
/usr/bin/timeout 30m /opt/mprime -t /bin/grep -i error /root/result.txt
Every minute ipmi sensors check the temperature of the processor, permissible - less than 60C. The program detects CPU architecture errors in the / proc / kmsg and mprime results.txt files.
RamSome memory cells may be damaged - you need to check each. Classic Memtest + will not work. In the free version, it does not save the results, only displays on the screen. Therefore, we use memtester. We start it from under OS, at the same time the cells which are not occupied by OS are checked.
memtester `cat /proc/meminfo |grep MemFree | awk '{print $2-1024}'`k 5
We look at the result of the query: if the memory is working properly, the program returns 0.
StorageDoes the program find all devices in / dev / sd? and / dev / cciss / c0d? and checks each item whether it is a disk or not.
hdlist() { HDLIST=$(ls /dev/sd?) HDLIST="${HDLIST} $(ls /dev/cciss/c0d? 2>/dev/null)" REAL_HDLIST="" for disk in ${HDLIST}; do if head -c0 ${disk} 2>/dev/null; then REAL_HDLIST="${REAL_HDLIST} ${disk}" fi done echo "${REAL_HDLIST}" }
Now you need to check all the drives.
HDD- completely clear the hard drive from the data of the previous user:
for DISK in $(hdlist) do echo "Clearing ${DISK}" parted -s ${DISK} mklabel gpt dd if=/dev/zero of=${DISK} bs=512 count=1 done if [ "($FULL_HDD_CLEAR)" = "YES" ]; then echo "Clearing disks full (very slow)" wget -O /dev/null -q --no-check-certificate "${STATEURL}&info=slowhddclear" for DISK in $(hdlist) do echo "Clearing ${DISK}" dd if=/dev/zero of=${DISK} bs=1M done fi
- check the value of the smart attribute Reallocated Sectors Count - must be at least 100,
- check the speed of the disk.
The program evaluates the speed in three disk offsets: at the beginning, middle and end - each offset of 4 GB. This is enough to make a general conclusion. For each offset we use this function:
sysctl -w vm.drop_caches=3 > /dev/null zcav -c 1 -s ${SKIP_COUNT} -r ${OFFSET} -l /tmp/zcav1.log -f ${DISK} if [ $? -ne 0 ]; then echo err exit fi SPEED=$(cat /tmp/zcav1.log | awk '! /^#/ {speed+=$2; count+=1}END{print int(speed/count)}')
SSDCheck the value of smart attributes:
Media_Wearout_Indicator is the lifetime or wear of the disk: for the new - 100, the minimum allowable is 10.
Reallocated_Sector_Count - the number of reassigned sectors - must be less than 100.
RAID statusIdentify the disk by RAID model and check the status of the array. If it is in operation, it will be “optimal”.
detect_raid_type() { RAIDSTR=$(lspci | grep -i raid) if echo ${RAIDSTR} | grep -iq adaptec; then # THis is adaptec echo "adaptec" elif echo ${RAIDSTR} | grep -iqE 'lsi|megaraid'; then # THis is LSI echo "lsi" elif echo ${RAIDSTR} | grep -iq '3ware'; then # THis is 3ware echo "3ware" elif echo ${RAIDSTR} | grep -iqE 'Hewlett-Packard.*Smart'; then # THis is HP Smart Array echo "HP-SmartArray" elif dmesg | grep -q cciss/ ; then echo cciss else echo "unknown" fi } raid_status_adaptec() { RSTATUS=$(arcconf getconfig 1 ld | awk -F: '/Status of logical device/ {print $2}') if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then echo "${RSTATUS}" return 1 fi } raid_status_3ware() { echo "We have not support 3ware yet" return 0 } raid_status_lsi() { RSTATUS=$(megacli -LDInfo -Lall -aALL |awk -F: '$1 ~ /State/ {print $2}') if ! echo "${RSTATUS}" | grep -q 'Optimal' ;then echo "${RSTATUS}" return 1 fi } raid_status_unknown() { echo "Unknown RAID" return 0 } raid_status_cciss() { RSTATUS=$(cciss_vol_status /dev/cciss/c*d0) if ! echo ${RSTATUS} | grep -q "OK" ; then echo "${RSTATUS}" return 1 fi }
NetworkCheck the download speed over the network - should be more than 300 Mbps.
curl -k --progress-bar -w "%{speed_download}" -o /dev/null "($CGI_MGR_URLv4)/speedtest_cgi?id=($AUTH_ID)&func=server.speedtest"
StatisticsThe diagnostic program checks an average of 323 servers per month, 124 of them do not pass the test - we do not sell these servers. First, the data center engineers change disks, repair coolers. CPU and RAM, we usually change the warranty.
Let's look at statistics on working HDD. For analysis, we took 1800 reports for different disks - a total of 103 models.
Attribute Name | min | Expected value | max | Standard deviation | Description |
---|
Temperature_Celsius | 14 | 25.81 | 40 | 4.09 | 25C - great temperature for a disc |
Power_On_Hours | 407 | 24033 | 59363 | 12910 | Funny. Some discs have worked for 6 years |
Reallocated_Sector_Ct | 0 | 92.3496 | 10728 | 496 | 100 is a good threshold |
Raw_Read_Error_Rate | 0 | 32416965 | 4294967295 | 126899820.1 | All values ​​are large. At the slightest problem, there are a lot of errors on the trigger. |
SSD Power_On_Hours | ten | 23159 | 918502 | 134915 | More than two years - not bad |
Excellent numbers, now let's check how much the HDD works on average. To do this, we compiled statistics on broken disks, focused on the Raw Read Error Rate.
Attribute Name | min | Expected value | max | Standard deviation | Description |
---|
Power_On_Hours | 0 | 25040 | 57178 | 12030 | HDD works 33 ± 16 months. Large spread - difficult to draw conclusions |
Statistics is an interesting thing, but not the main one. We do diagnostics not for the sake of numbers, but for customers: so that the data center has working servers, and the site has updated information. Then each client receives:
- server of the required capacity - payment according to the tariff;
- reliable equipment - no interruption in the work of projects.