📜 ⬆️ ⬇️

Factory Testing: The Black Box and Short Testing Cycle

Hello everyone, my name is Alexander, and I deal with QA (quality control assurance) of the products we develop. Our counterpart fabs in Southeast Asia, especially the Chinese, are sharp, nimble, and ready to do a lot, quickly, but not always with quality. As we are struggling with this, saving the company's money along the way - it is written under the cut.

Chinese workers

Printing and layout of our printed circuit boards is far from always realistic or advisable to do in our country, so often production is performed where it is easier, cheaper and faster, that is, in China and in Taiwan. And when it comes to large batches and mass production, the choice of production site becomes more and more obvious. However, that for us is a big party, for the Chinese it is not that big, so there is less attention in terms of quality and its provision. I do not want to develop Chinese colleagues for every small customer, according to their standards, testing methods and test benches.

Just an example from personal experience: at some factory for one of the companies where I worked, they released fifty server motherboards with identical MAC addresses of network interfaces, which led to interesting effects when they work in the same network. And since no one expected such a setup from them - the reason why the cluster assembled from these boards refused to work was not immediately detected, and the investigation into the causes of the malfunction took considerable time.
')
Of course, you can make a claim and send all the “bad” boards back to the factory, but firstly it’s a long time, and secondly it’s expensive, and it doesn’t guarantee that the new cards you’ve sent will not have any other disadvantage. All these problems could have been avoided if the uniqueness of the MAC-addresses in the case of someone checked before these boards got to the customer.

It also often happens that the MAC addresses are simply not flashed, or any chips on the board are not soldered or damaged, and sometimes they even forget to install something. In my practice, the Chinese have a special relationship with the firmware versions with which they supply boards at the factory. Depending on the mood in production, the number of different firmware versions of the same device can go up to 3-4, and of course, not all of them are release and working.

Stop this disgrace!


The issue of ensuring the quality of serial products can be solved by sending someone from the staff on a business trip to the factory. But this is a bad option, given that the factory may not be one (in our case, there are just five of them), and the adjustment of production may take a considerable time. Teaching our developed methods of quality control of the Chinese is also not an option, because they also have their own factories, which, however, most often do not suit us.

Our solution for quality control of serial products is the “black box”. A device that is as non-interactive as possible during the test, takes up little space and requires a minimum of manipulation by production personnel. Roughly speaking, the box is connected to the board under test, then after some time a green or red light comes on. Green - everything is OK, the board can be packed and shipped, it is working. Red - not ok, there are problems with the board, they need to be eliminated and then tested again, and so on until victory.

Of course, we rather appreciate the ingenuity of our Chinese colleagues, so we provided for several levels of protection against manipulation of the results, and 146% of successful tests will not work. To do this, each testing procedure is logged, the logs are stored in a “black box” in an encrypted form and are sent to us. For example, if the board was not tested using the “black box” at the factory, then we will find out about it by looking at the logs and records in the database. No record - the board was not tested, it means it should not have been sent to us. There is also protection from manual "generating" logs and test results. We will not describe it, of course.

What's in the black box?


The black box itself is a compact computer such as Intel NUC or Gigabyte BRIX , connected to the network via Wi-Fi or Ethernet, as well as via USB adapters to the board under test. A monitor can be connected to the computer, which displays the testing progress, and in case of errors, the errors themselves. In general, we are thinking of switching to using the Mini-ITX format so that the configuration can be varied more widely.

The computer disk contains a Linux-based image with a set of test utilities needed to test a specific type of motherboard. For each type of preparing their own image and a set of test scripts. If we talk specifically about server motherboards, then in addition to the test server image on the computer there is also an image of the test PXE client.

The motherboard downloads a test image on PXE. It connects via Ethernet to the “black box” and its BMC receives an IP address via DHCP. The process of powering the motherboard is initiated through the BMC, a test image is loaded, tests are performed. The results of the test are displayed on the screen of the monitor connected to the “black box”, then the board power is automatically turned off via the BMC. After that, the production worker can disconnect the board from the test bench and test the next one. This happens until the required number of boards is tested.

YADRO Blackbox PXE boot menu

Our “black box” is connected via Ethernet not only to the board under test, but also to the Internet: it periodically “knocks” on our server and sends logs of tests made, for example, in a day. Thus, we see how many boards were tested, how many defects, and we can roughly predict when we will receive the long-awaited batch of various boards.

What is checked and how?


A little about the tests themselves. Since the time for testing is limited and the production is not ours, a small amount of time is allotted for tests — about 5 minutes per board. If we talk about testing motherboards, during this time our test software manages to load the operating system on the motherboard, verify the uniqueness of MAC addresses and other unique IDs, firmware versions of microcontrollers and BMC, sensor values ​​and operation, operation of all expansion slots, processor sockets, Are all processors and memory modules visible to the operating system, etc.?

A short cycle of load tests is also launched in order to “warm up” the board and evaluate the performance problems of its assembly. That is, these tests are a kind of smoke test in order to quickly see some obvious problems. For ease of support and test development, they are mostly made on bash and under Linux.

The test results for the factory employee on the monitor screen look something like this (varies according to the type of board under test):

PASSED 34 of 36 tests
FAILED: CPU2 temp; BMC version


This is an example for the case of tests with errors. Decoding errors and their description is usually made up as a separate file, or printed out on a piece of paper - so that an employee at the factory could understand what exactly the “dislike” of the test.

The list of tests itself and the estimated values ​​of the test results are contained in a separate configuration file. It specifies, for example, the minimum and maximum temperature values ​​of the sensors of any card, the number of devices that should be visible on the PCI bus or, for example, USB. During the test, the obtained values ​​are compared with those specified in the configuration file. Depending on whether the obtained values ​​“get” into the configuration file during testing or not, the test is considered successful or unsuccessful.

In order to be able to identify the tested board, each board is signed with a unique ID, and during the test the ID is read and written to the database. A board that has not previously been tested does not have this ID, so a new one is generated for the new board and is recorded. For different boards, the ID is written in different places depending on what chips with non-volatile memory are on the board. That is, the location of the ID depends on the architecture of a particular card.

Using the “black box” solved several problems for us at once:


All testing procedures described above are performed by QC (quality control) specialists at counterpart factories. These quick checks filter out the most common problems. The boards that passed the basic filter get to us - we check them in more detail. After this, there is another verification stage, already in the composition of the final product.

QA and QC


Just in case, I will say a little about the difference between QA and QC. It would seem that those and others are engaged in testing, but there are significant differences. One of the goals of QA is to develop and maintain quality control procedures - this is how the quality of serial products manufactured by fabs is ensured. In order for this to be possible, a QA must be sufficiently well aware of what it tests, what results need to be obtained, and what quality assessment criteria are laid down in test procedures.

As a rule, a QC specialist does not know, and he doesn’t have to know how this particular board works. His task is to follow the instructions, and roughly speaking, to monitor the color of the light bulb on the test stand issued to him. QC there is no difference what to check - complex hardware systems or the brightness of the LEDs. He follows the instructions, and in accordance with it receives the result of passing or failing quality checks.

After checking by the QC forces of the entire batch and receiving by the customer, that is, by us, the QA again comes into play, as colorfully wrote asmolenskiy in one of our previous articles . At the same time, individual samples of the received batch are tested again, both individually and in combination with other cards and devices. But as a result of QA checks, a new revision of one or several boards may arise, and everything starts again.

Source: https://habr.com/ru/post/336822/


All Articles