📜 ⬆️ ⬇️

Experience in ensuring the reliability of computing equipment during long-term operation

The 10th anniversary of the operation of a small series of computerized systems designed under my leadership ends, and it is possible, without pretending to the universality of conclusions, however, on the occasion of the jubilee, to sum up some results in terms of the reliability of the operation of computing equipment over long time intervals.

The product, whose operating experience we are considering, is intended for carrying out measurements in real time and consists of a series of electronic modules of its own design and a hierarchically organized group of computers: an industrial high-level workstation such as ICP PPC-5150 running Windows, an industrial control computer like ICP WS -855 with a single Rocky-C800 processor card running DOS and an embedded Fastwel CPU-188 computer running DOS. The unit of operation at the facility is a group of two products reserving each other, and a group kit of spare parts and accessories (spare parts and accessories). In total, there are about 10 objects of operation in various localities of Russia (respectively, 20 products, or 80 of their computers, taking into account spare parts and accessories). The warranty period for the products is 10 years, the intended service life is 20 years.

In general, the operation of products for 10 years was successful. Thanks to the well-chosen reservation policy (full hot reserve plus a priori least reliable units and modules in the spare parts kit), there was not a single case of inability to use the product for its intended purpose.

You can bring the main conclusions in the field of reliability for developers of computer systems, obtained from the results of the operation of the aforementioned series and, in part, other products. Some useful specifics of the collected statistics, distinguishing it from the broader data of repair centers, are made identical by the hardware configuration, software, and the target task of operating the products installed at various sites. So, the conclusions:
')
1. A significant number of product failures (in our case, about 50%) is associated with the failure of mass-produced computer components. This result turned out to be quite unexpected for us, since we did not save on components, and a priori expected less reliability of our own electronic equipment due to its less refinement. On the series of products described above, we received on average one complaint for computer components per year.

2. There is an initial period of operation (several months) during which the inherent defects in the component manifest themselves, which did not have time to manifest themselves when tested by the manufacturer. The statistics of malfunctions at the initial period is apparently connected with an unrecognized factory defect, and differs significantly from the statistics of the subsequent period (after a year and beyond) associated with the degradation of characteristics during operation. Most of the faults detected in the initial period are not repeated.

3. If computer components fail in the second or third year of operation, it is likely to state that the same components will fail in the future. From this we can conclude about the feasibility of creating additional repair stock for the results of two or three years of operation, while components of this type have not yet been removed from production.

4. Computers such as PPC-5150 and their components failed several times, while WS-855 and CPU-188 and their components never after the initial period. Presumably, this is due to a higher degree of integration, a higher frequency and a higher temperature in the PPC-5150.

5. There is a very significant dependence of the probability of failure of electronic modules on the object of their operation. This dependence does not boil down to the human factor, since during the formation of our sample, the operating personnel at the sites changed, but the nature of the statistics did not. Presumably, the reasons are reduced to the quality of power supply or climatic features.

6. In compact system blocks, manufacturers tend to choose non-standard design solutions and change them as needed. This leads to the fact that, for example, the nominally common ICP PPC-5150 series falls into a series of constructively incompatible models. If, for example, the motherboard fails in the old PPC-5150, it is impossible to install the board from the new PPC-5150 in its place (at least without a jigsaw and epoxy resin), and the entire system unit has to be replaced. [There might be a mention of the company Apple].

7. When designing products with a long service life, increased attention should be paid to built-in power sources (batteries), whose service life is limited. If standard batteries, such as, say, CR2032 on the motherboard, can be easily calculated to be replaced after several years, then with integrated modules and microassemblies that include the battery in their non-separable design, significant problems may arise when they are removed from production.

8. The computer system unit, put entirely in the ZIP, can be very useful. It is much easier for operating personnel to replace the entire system unit, and then, together with the manufacturer, deal with the localization of the malfunction, rather than trying to find the defective part on the spot. In general, practice shows that the only type of repair work that it is advisable to assign to operating personnel who do not have special qualifications is the replacement of defective blocks or modules as a whole.

9. Information recorded on a DVD-R / RW has been around for more than a few years only with known luck. However, it is better to write a separate article about the long-term archival storage of information.

Source: https://habr.com/ru/post/281945/


All Articles