On the issue of availability

The article explains some points related to the practical implementation of requirements for computer system readiness.

Please note: the article is intended for ordinary IT-specialists and IT-managers who are faced with the need to meet formal requirements for readiness, and does not contain anything new for specialists in the field of reliability.

When building systems that are subject to requirements in the field of reliability and fault tolerance, the concepts of Kg availability factor and Koper operational availability factor are often used in domestic engineering practice. In accordance with GOST 27.002-89,

Kg (t) = Tispr (t) / (Tisp (t) + Tprost (t)),
')
that is, the ratio of the time of good work to the sum of the time of good work and downtime over the service life t;

Koper (t, tau) = Kg (t) * P (tau),

where P (tau) is the probability of failure-free operation on the tau interval, that is, the probability that, if the system was operational for a certain period of time, then in the following tau units of time it will not fail.

The operational readiness ratio is important mainly for products whose resources are intensively consumed during operation - all kinds of hard materials processing tools, firearms, high-power lasers and the like that break down during operation of technical systems. For sufficiently reliable and long-lived devices to which computer technology belongs, the probability P (tau) in the unit hours interval typical for a typical work session is close to one, therefore the Koper coefficient usually differs very little from Kg.

The main difficulty for computing systems, as a rule, consists in providing the target value of the availability factor Kg. The calculation of Kg can be approached formally or thoughtfully.

In the formal approach, it is implied that the product can always be repaired by replacing the failed unit with one taken from spare parts (spare parts stock), if only the spare parts were found in the same. For the calculation of spare parts, based on the specified indicators of the reliability of the blocks, there are special techniques and ready-made software tools that make it relatively easy to obtain the desired result. However, from a probabilistic point of view, the problem here is that the adopted reliability model considers the failures of different products as independent events, which for computing technology does not correspond to reality on long time intervals — often devices fail at the same time as in operation and during storage.

With a thoughtful approach, we are obliged to consider the possibility of a situation when the block taken from the replacement kit was also unworkable (which is quite likely, given the nature of the degradation of the characteristics of computing technology, often depending more on the lifetime of the device than on the intensity of its operation). A variant of this situation is the initial absence of the required unit in the spare parts kit due to a too optimistic initial assessment of its reliability. Then the idle time will consist of the notification time by the operating personnel responsible for repairing the failure, receiving the repair organization or subdivision of the faulty unit, searching for and acquiring a new similar (or, if less successful, deciding whether to change the product design), checking it, setting up , dispatch to the operating organization and replacements. Practice shows that for single units that do not have multiple redundancies at a repair warehouse, in this case it is almost impossible to reduce the downtime to less than two months (considering that only one purchase period for some components can reach 60 days or more).

It should be noted that, of course, when purchasing equipment for a responsible purpose, it is preferable to enter into a service contract with the manufacturer, ensuring the replacement of failed components in a short time. However, rarely such contracts are available for more than 5 years, which is often not enough for the planned service life of industrial systems.

Solve simple proportions arising from the formula of availability:

TISPR1 / (TISPR1 + 2 months) = 0.95

and

TISPR2 / (TISPR2 + 2 months) = 0.99

for typical values of the availability factor 0.95 and 0.99.

We get: Tispr1 = 38 months (3 years) and Tispr2 = 188 months (16 years).

Thus, to ensure the availability factor of Cg = 0.95, it is necessary to use blocks with an expected service life of 3 years in the operated product (and its spare parts kit) and at the same time, fill up the spare parts kit in no more than 2 months. Such conditions seem realistic, and the strategy for the restoration of the product's performance by replacing the failed units from the spare parts kit is in this case quite adequate.

A different picture emerges for Kg = 0.99. To achieve an availability factor of 0.99, it is necessary to ensure that either the expected service life of all units exceeds 16 years, or repairs in the absence of spare parts at the facility were performed faster than 2 months, or there was always a serviceable spare part for all units at the facility for 16 years. The first two requirements to satisfy the current state of affairs seem unrealistic. The latter requirement cannot be met during passive storage of spare parts, as it is likely that after 16 years, when a unit fails, its replacement from the spare parts will also be out of order. The only way to satisfy this requirement is continuous performance monitoring and replacement in the event of failure of all units, including spare ones. The way to ensure such control is a hot spare strategy.

Findings:

1. For modern computer technology, under typical operating and maintenance conditions, it is possible to achieve a readiness factor of 0.95 with a product recovery strategy by replacing failed units from spare parts.

2. For modern computing equipment, under typical operating and maintenance conditions for single products, achieving a availability factor of 0.99 is impossible only by using spare parts and requires the use of hot backup or another method of continuous monitoring of all units, including spare ones.

Source: https://habr.com/ru/post/281723/

All Articles

On the issue of availability

More articles: