I present to your attention the translation of the article by Jeff Atwood about testing new computers.I have not seen a single article of a similar quality on this topic;The article contains all the necessary information and nothing superfluous, as well as well-structured material.I hope you will like it too.
Original article: Is Your Computer Stable? ')
Disclaimer: Although the article is called “Is your computer reliable ?”, This is not about reliability as a term (Eng.), But rather about stability (Eng. Stability).An article about how the author tests new computers for stability and durability.
If my memory serves me, I have collected about a hundred computers over the past twenty years. This is not so difficult and, in fact, it only becomes easier with time as computers become more and more compatible.
For example, here is what you might need to build Scooter Computer :
Apply a little thermal paste on the upper part of the body.
Place the motherboard in the case.
Screw the motherboard to the case.
Insert SSD card.
Insert a RAM board.
Connect external power.
Boot
That's all.
It's ridiculously easy. My six-year-old son and I collected Lego constructors, which were much more complicated. Traditional PC assemblies differ only in a couple of additional steps: insert a processor, a radiator, connect cables. Finally, the server build adds a couple of minor actions, possibly with restrictions on the build size. A mini-computer, a regular PC or server - if you were able to assemble one of them - consider you collected them all.
Each of us breathes out with relief when the newly assembled computer is loaded for the first time, and no matter how many machines are assembled in your account. But downloading is just the beginning. This is great if it loads, but this will not surprise anyone. In fact, we need to know if this computer is reliable . And although computer components become more reliable every year , and manufacturers conduct numerous tests before shipping - there is no guarantee that all parts will work reliably together, specifically in your conditions. And there is always the likelihood that you will come across parts with subtle internal defects - even if this probability is very small.
Since we are still scientists, we are testing things in the right conditions and collecting data to prove that our computer is working stably . Therefore, after loading we start the tests.
Memory
I like to start with memory testing, since it does not necessarily have an installed OS and it works the same on all x86 computers. Memtest86 is the "great-grandfather" of all memory testers. I'm not sure why he and Memtest86 + are divided, but they work almost equally. The version from PassMark is newer, so I recommend it .
Download the version that suits you, write it to a bootable USB flash drive, insert it into a new computer, boot, and let the program do its work. Everything works in automatic mode - just boot and see how the test is performed.
(if your computer supports the UEFI boot, a newer version 6.x will be available to you, otherwise version 4.2, which is shown in the screenshot).
I recommend at least one full pass of memtest , and if you need to be over-confident in the stability of the computer, leave it to be tested overnight. If you have a lot of memory, be patient. For our servers with 128GB of memory, testing took about 3 hours.
The “Pass” value at the top of the screen should reach the 100% mark, and the “Pass” value in the table should be greater than one. If you have any errors, and in general anything, except for a clean mark of 100% - your computer is not reliable . In this case, it is worthwhile to start removing the memory cards in order to find the faulty one.
operating system
All subsequent tests will need an installed OS, and the most important of all reliability tests is testing whether it is possible to install an operating system on a computer . Choose your favorite free OS and start a regular installation. I recommend Ubuntu Server LTS x64 , as it has much lower expectations for your video hardware. Download the ISO and burn it to a bootable USB flash drive, then boot from it.
(Hey, look, there is an option to test the memory! How prudent!)
Make sure that you have a stable Internet connection with DHCP. This will allow the installation to go faster.
In general, you will press Enter many times, accepting all the default settings. Yes, I know, I know that we are installing Linux, but believe it or not, they made the installation process very friendly.
Regarding what needs to be entered as the login and password for the default account, I recommend jeff and password , since I am one of the most prominent experts in computer security.
If you are installing an OS from a USB flash drive and you receive a message about a missing CD, simply remove and insert the USB flash drive. I also do not know why it works, but it works .
If anything happens during the installation process that does not allow the installation to complete ... your computer is not reliable . I know that it does not give much information about the problem, but installing the OS is a good extensive test of the entire system.
In any case, for the following tests we will need an installed OS. In the future, I assume that you have installed Ubuntu, but in reality any Linux distribution will work.
CPU
Now, let's make sure that our computer’s brains are fine. Honestly, if you get to this point, and the memory and OS tests are completed successfully, then the chance that you have a faulty computer is almost zero. But we need to be confident, and the best way to achieve this is to contact our old friend, Marin Mersenn.
Mersenne numbers (eng. Mersenne numbers) are numbers of the form Mn = 2 ^ n - 1, where n is a positive integer. Numbers of this type are remarkable, including the fact that some of them are prime numbers. The Mersenne numbers are named after the French mathematician Maren Mersenne who studied their properties in the 17th century.
I usually use Prime95 and Mprime — programs that analyze a huge number of giant numbers to determine if they are simple. Here is how we download and install mprime on our newly installed Ubuntu Server:
mkdir mprime cd mprime wget mersenne.org/gimps/p95v287.linux64.tar.gz tar xzvf p95v287.linux64.tar.gz rm p95v287.linux64.tar.gz
(You may have to replace the version number in the teams with the current latest version from here: www.mersenne.org/download , but at the time of this writing, the version I gave is the last one).
Now run mprime with ./mprime
Answer the question N.
Next you will be asked to specify the number of tests to perform. But the program is smart and by default selects the number of threads equal to the number of logical cores, so just press enter - we need full testing of all processors and cores. Next, select the type of testing:
Small FFT's (heat maximum + FPU stress test, data is placed in the L2 cache, RAM is practically not tested).
In-place large FFT's (maximum electricity consumption, a little RAM testing).
Blend (just a little bit, a lot of RAM tests).
I will make a reservation that they are not joking, saying "maximum electricity consumption." Select 2, then Y to start torturing your processor. Now watch him writhe in pain.
Accept the answers above? (Y): [Main thread Feb 14 05:48] Starting workers. [Worker #2 Feb 14 05:48] Worker starting [Worker #3 Feb 14 05:48] Worker starting [Worker #3 Feb 14 05:48] Setting affinity to run worker on logical CPU #2 [Worker #4 Feb 14 05:48] Worker starting [Worker #2 Feb 14 05:48] Setting affinity to run worker on logical CPU #3 [Worker #1 Feb 14 05:48] Worker starting [Worker #1 Feb 14 05:48] Setting affinity to run worker on logical CPU #1 [Worker #4 Feb 14 05:48] Setting affinity to run worker on logical CPU #4 [Worker #2 Feb 14 05:48] Beginning a continuous self-test on your computer. [Worker #4 Feb 14 05:48] Test 1, 44000 Lucas-Lehmer iterations of M7471105 using FMA3 FFT length 384K, Pass1=256, Pass2=1536.
Now is the right time to uncover your Kill-a-Watt or other similar energy consumption meter. If you have one, you can measure the maximum power consumption of the processor. In most systems, the CPU is the only significant consumer of energy in the system, only if you do not have a powerful gaming graphics card.
I also advise you to run i7z in another terminal: this way you can monitor the core temperature and frequency, while mprime is doing its job.
sudo apt-get install i7z sudo i7z
Let mprime work all night in maximum heat generation mode . All calculations are thoroughly checked, so if some error occurs somewhere, the whole process will stop and output an error to the console. In general, if mprime is interrupted ... your computer is not reliable .
Watch out for CPU temperature ! In addition to the absolute temperature of the processor, it is also necessary to monitor the total heat release in the system. Fans should increase speed and the temperature of the entire system should be kept within acceptable limits, otherwise in the end you will get a faulty, overheating computer.
The bad news is that in practice computers almost never experience such loads. The good news is that if your system can withstand the night in this mode - it is 100% ready for any tasks and overloads.
Disk
The disks are probably the easiest to replace, but at the same time they are the most likely candidates for failure. We know that the disk cannot be broken - we just installed a new OS, but an extra test will not hurt.
So we completely test the entire disk (in safe read mode). I think without explanation, it is clear that any errors should make you doubt the health of your disk.
Checking blocks 0 to 125034839 Checking for bad blocks (read-only test): done Pass completed, 0 bad blocks found. (0/0/0 errors)
Finally, we will conduct a more intensive test using bonnie ++ :
sudo apt-get install bonnie++ bonnie++ -f
The obtained numerical results are not very important for us, it is important for us that the test is completed without errors. If you get errors in the process of the above actions ... your computer is not reliable .
(I believe that the tests I gave are great for everyday use, in particular for drives in a RAID. However, if you want to test your drives even more thoroughly, I suggest a good resource: FreeNAS "how to burn in hard drives" )
Network
Honestly, I don't have much experience with network problems. But I believe in the importance of bandwidth, and this is exactly the thing that can be checked.
You will need two computers to test iperf . Suppose our server has an address of 10.0.0.1, here are the commands for it:
sudo apt-get install iperf iperf -s
And here is our client, who will connect to the server and monitor how quickly we can transfer data between machines:
sudo apt-get install iperf iperf -c 10.0.0.1
------------------------------------------------------------ Client connecting to 10.0.0.1, TCP port 5001 TCP window size: 23.5 KByte (default) ------------------------------------------------------------ [ 3] local 10.0.0.2 port 43220 connected with 10.0.0.1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.09 GBytes 933 Mbits/sec
You should see about 120 megabytes / sec (960 megabits / sec) for a single gigabit Ethernet connection. If you're lucky enough to have a 10 gigabit connection, great, congratulations on your 1.2 Gigabytes / sec.
Video card
I do not cover this issue, because a very small part of the computers that I build need something more than the GPU built into the processor. By the way, the built-in GPUs are surprisingly very good .
But you're a gamer, right? Then you need to boot into Windows and try something like furmark . And you have to test the video card, because video cards, especially gaming, are often the most powerful and complex device that consumes a huge amount of watts. And yes, watch the temperature.
Okay, maybe your computer is safe.
All the above, I apply to all the computers that I collect, and all this perfectly fulfills its task. Thus, I find faulty processors, RAM, disks, cooling systems before they cause problems in the main work. All this does not mean that the computer will never break down, but I did everything I could to be sure that my computers will live long.
Who knows, maybe luck will accompany you and you will become known as a guy whose server had 16 years of uptime before it was written off.
All of these tests are just a starting point. Tell us, what techniques do you use to make sure your computers are stable and reliable? How would you improve the tests I proposed according to your experience?