📜 ⬆️ ⬇️

Is your computer reliable?

I present to your attention the translation of the article by Jeff Atwood about testing new computers. I have not seen a single article of a similar quality on this topic; The article contains all the necessary information and nothing superfluous, as well as well-structured material. I hope you will like it too.

Jeff is the founder of StackOverflow . Now he is working on the project Discourse .

Original article: Is Your Computer Stable?
')
Disclaimer: Although the article is called “Is your computer reliable ?”, This is not about reliability as a term (Eng.), But rather about stability (Eng. Stability). An article about how the author tests new computers for stability and durability.


If my memory serves me, I have collected about a hundred computers over the past twenty years. This is not so difficult and, in fact, it only becomes easier with time as computers become more and more compatible.

For example, here is what you might need to build Scooter Computer :

  1. Apply a little thermal paste on the upper part of the body.
  2. Place the motherboard in the case.
  3. Screw the motherboard to the case.
  4. Insert SSD card.
  5. Insert a RAM board.
  6. Connect external power.
  7. Boot

That's all.



It's ridiculously easy. My six-year-old son and I collected Lego constructors, which were much more complicated. Traditional PC assemblies differ only in a couple of additional steps: insert a processor, a radiator, connect cables. Finally, the server build adds a couple of minor actions, possibly with restrictions on the build size. A mini-computer, a regular PC or server - if you were able to assemble one of them - consider you collected them all.

Each of us breathes out with relief when the newly assembled computer is loaded for the first time, and no matter how many machines are assembled in your account. But downloading is just the beginning. This is great if it loads, but this will not surprise anyone. In fact, we need to know if this computer is reliable .

And although computer components become more reliable every year , and manufacturers conduct numerous tests before shipping - there is no guarantee that all parts will work reliably together, specifically in your conditions. And there is always the likelihood that you will come across parts with subtle internal defects - even if this probability is very small.

Since we are still scientists, we are testing things in the right conditions and collecting data to prove that our computer is working stably . Therefore, after loading we start the tests.

Memory


I like to start with memory testing, since it does not necessarily have an installed OS and it works the same on all x86 computers. Memtest86 is the "great-grandfather" of all memory testers. I'm not sure why he and Memtest86 + are divided, but they work almost equally. The version from PassMark is newer, so I recommend it .

Download the version that suits you, write it to a bootable USB flash drive, insert it into a new computer, boot, and let the program do its work. Everything works in automatic mode - just boot and see how the test is performed.

image
(if your computer supports the UEFI boot, a newer version 6.x will be available to you, otherwise version 4.2, which is shown in the screenshot).

I recommend at least one full pass of memtest , and if you need to be over-confident in the stability of the computer, leave it to be tested overnight. If you have a lot of memory, be patient. For our servers with 128GB of memory, testing took about 3 hours.

The “Pass” value at the top of the screen should reach the 100% mark, and the “Pass” value in the table should be greater than one. If you have any errors, and in general anything, except for a clean mark of 100% - your computer is not reliable . In this case, it is worthwhile to start removing the memory cards in order to find the faulty one.

operating system


All subsequent tests will need an installed OS, and the most important of all reliability tests is testing whether it is possible to install an operating system on a computer . Choose your favorite free OS and start a regular installation. I recommend Ubuntu Server LTS x64 , as it has much lower expectations for your video hardware. Download the ISO and burn it to a bootable USB flash drive, then boot from it.

image
(Hey, look, there is an option to test the memory! How prudent!)


If anything happens during the installation process that does not allow the installation to complete ... your computer is not reliable . I know that it does not give much information about the problem, but installing the OS is a good extensive test of the entire system.

In any case, for the following tests we will need an installed OS. In the future, I assume that you have installed Ubuntu, but in reality any Linux distribution will work.

CPU


Now, let's make sure that our computer’s brains are fine. Honestly, if you get to this point, and the memory and OS tests are completed successfully, then the chance that you have a faulty computer is almost zero. But we need to be confident, and the best way to achieve this is to contact our old friend, Marin Mersenn.

image
Mersenne numbers (eng. Mersenne numbers) are numbers of the form Mn = 2 ^ n - 1, where n is a positive integer. Numbers of this type are remarkable, including the fact that some of them are prime numbers. The Mersenne numbers are named after the French mathematician Maren Mersenne who studied their properties in the 17th century.

I usually use Prime95 and Mprime — programs that analyze a huge number of giant numbers to determine if they are simple. Here is how we download and install mprime on our newly installed Ubuntu Server:

mkdir mprime
cd mprime
wget mersenne.org/gimps/p95v287.linux64.tar.gz
tar xzvf p95v287.linux64.tar.gz
rm p95v287.linux64.tar.gz

(You may have to replace the version number in the teams with the current latest version from here: www.mersenne.org/download , but at the time of this writing, the version I gave is the last one).

Now run mprime with ./mprime

image

Answer the question N.

Next you will be asked to specify the number of tests to perform. But the program is smart and by default selects the number of threads equal to the number of logical cores, so just press enter - we need full testing of all processors and cores. Next, select the type of testing:

  1. Small FFT's (heat maximum + FPU stress test, data is placed in the L2 cache, RAM is practically not tested).
  2. In-place large FFT's (maximum electricity consumption, a little RAM testing).
  3. Blend (just a little bit, a lot of RAM tests).

I will make a reservation that they are not joking, saying "maximum electricity consumption." Select 2, then Y to start torturing your processor. Now watch him writhe in pain.

Accept the answers above? (Y):
[Main thread Feb 14 05:48] Starting workers.
[Worker #2 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Worker starting
[Worker #3 Feb 14 05:48] Setting affinity to run worker on logical CPU #2
[Worker #4 Feb 14 05:48] Worker starting
[Worker #2 Feb 14 05:48] Setting affinity to run worker on logical CPU #3
[Worker #1 Feb 14 05:48] Worker starting
[Worker #1 Feb 14 05:48] Setting affinity to run worker on logical CPU #1
[Worker #4 Feb 14 05:48] Setting affinity to run worker on logical CPU #4
[Worker #2 Feb 14 05:48] Beginning a continuous self-test on your computer.
[Worker #4 Feb 14 05:48] Test 1, 44000 Lucas-Lehmer iterations of M7471105 using FMA3 FFT length 384K, Pass1=256, Pass2=1536.

Now is the right time to uncover your Kill-a-Watt or other similar energy consumption meter. If you have one, you can measure the maximum power consumption of the processor. In most systems, the CPU is the only significant consumer of energy in the system, only if you do not have a powerful gaming graphics card.

I also advise you to run i7z in another terminal: this way you can monitor the core temperature and frequency, while mprime is doing its job.

sudo apt-get install i7z
sudo i7z

Let mprime work all night in maximum heat generation mode . All calculations are thoroughly checked, so if some error occurs somewhere, the whole process will stop and output an error to the console. In general, if mprime is interrupted ... your computer is not reliable .

image

Watch out for CPU temperature ! In addition to the absolute temperature of the processor, it is also necessary to monitor the total heat release in the system. Fans should increase speed and the temperature of the entire system should be kept within acceptable limits, otherwise in the end you will get a faulty, overheating computer.

The bad news is that in practice computers almost never experience such loads. The good news is that if your system can withstand the night in this mode - it is 100% ready for any tasks and overloads.

Disk


The disks are probably the easiest to replace, but at the same time they are the most likely candidates for failure. We know that the disk cannot be broken - we just installed a new OS, but an extra test will not hurt.

Let's start by testing the “bad” blocks (Badblocks) :

sudo badblocks -sv /dev/sda

So we completely test the entire disk (in safe read mode). I think without explanation, it is clear that any errors should make you doubt the health of your disk.

Checking blocks 0 to 125034839
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)

Now check the SMART recording for our disk.

sudo apt-get install smartmontools
smartctl -i /dev/sda

The above command will let you know if your drive supports SMART. If yes, let's activate it:

smartctl -s on /dev/sda

Now we are ready to run SMART tests. But first, let's find out how long different tests will run:

smartctl -c /dev/sda

Run a long test if you have time or a short if not.

smartctl -t long /dev/sda

Tests are performed asynchronously; after the specified time has elapsed, open the SMART test report and make sure that everything went well:

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 100 -

Next, run a simple benchmark to make sure that the disk performance roughly corresponds to the expected:

dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
hdparm -Tt /dev/sda

For a system with a conventional SSD, you should get at least the following results, and most likely much better:

536870912 bytes (537 MB) copied, 1.52775 s, 351 MB/s
Timing cached reads: 11434 MB in 2.00 seconds = 5720.61 MB/sec
Timing buffered disk reads: 760 MB in 3.00 seconds = 253.09 MB/sec

Finally, we will conduct a more intensive test using bonnie ++ :

sudo apt-get install bonnie++
bonnie++ -f

The obtained numerical results are not very important for us, it is important for us that the test is completed without errors. If you get errors in the process of the above actions ... your computer is not reliable .

(I believe that the tests I gave are great for everyday use, in particular for drives in a RAID. However, if you want to test your drives even more thoroughly, I suggest a good resource: FreeNAS "how to burn in hard drives" )

Network


Honestly, I don't have much experience with network problems. But I believe in the importance of bandwidth, and this is exactly the thing that can be checked.

You will need two computers to test iperf . Suppose our server has an address of 10.0.0.1, here are the commands for it:

sudo apt-get install iperf
iperf -s

And here is our client, who will connect to the server and monitor how quickly we can transfer data between machines:

sudo apt-get install iperf
iperf -c 10.0.0.1

------------------------------------------------------------
Client connecting to 10.0.0.1, TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.2 port 43220 connected with 10.0.0.1 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec

You should see about 120 megabytes / sec (960 megabits / sec) for a single gigabit Ethernet connection. If you're lucky enough to have a 10 gigabit connection, great, congratulations on your 1.2 Gigabytes / sec.

Video card


I do not cover this issue, because a very small part of the computers that I build need something more than the GPU built into the processor. By the way, the built-in GPUs are surprisingly very good .

But you're a gamer, right? Then you need to boot into Windows and try something like furmark . And you have to test the video card, because video cards, especially gaming, are often the most powerful and complex device that consumes a huge amount of watts. And yes, watch the temperature.

Okay, maybe your computer is safe.


All the above, I apply to all the computers that I collect, and all this perfectly fulfills its task. Thus, I find faulty processors, RAM, disks, cooling systems before they cause problems in the main work. All this does not mean that the computer will never break down, but I did everything I could to be sure that my computers will live long.

Who knows, maybe luck will accompany you and you will become known as a guy whose server had 16 years of uptime before it was written off.

image

All of these tests are just a starting point. Tell us, what techniques do you use to make sure your computers are stable and reliable? How would you improve the tests I proposed according to your experience?

Source: https://habr.com/ru/post/390499/


All Articles