Real-world tasks: how in practice do systems consider reliability (reliability, MTTF, failure rate)?

In the previous article, we looked at the terminology and mathematical basis for calculating the fault tolerance of various systems and found that in practice, when it comes to MTTF (Mean Time To Failure) estimates and other reliability characteristics, in most cases it is assumed that failures are subject to Poisson model . Accordingly, their probabilistic description is based on an exponential probability distribution.

This material will be devoted to the practical aspects of the use of this model, and you should immediately make a reservation that it is widely used both in electronics and in various fields: for example, in assessing risks in the aviation and nuclear industries, forecasting in the automotive industry, assessing the reliability of cloud services on the Internet, etc. The general assumption, I repeat, is the hypothesis about the constancy of the failure rate λ, which, as we saw in the previous article, is inversely proportional to the average uptime MTTF = 1 / λ.

So, let's start by looking at a very simple example: a device consisting of two elements, for each of which the failure rates λ ₁ and λ ₂ are known. Failure of any of the elements leads to failure of the device as a whole. For example, a computer (conditionally) can be represented as a system consisting of a processor and a motherboard. Let the mean time to failure (MTTF) for them be 2 and 3 years (respectively, λ ₁ = 1/2 year ^-1 and λ ₂ = 1/3 year ^-1 ). What would be the MTTF score for the computer as a whole? And what is the probability of failure of the computer 1 year after the start of operation?
')

First of all, we recall that the probability of failure of each component, according to our model,
Q ₁ (t) = 1-exp (-λ ₁ t),
Q ₂ (t) = 1-exp (-λ ₂ t).
Accordingly, the probability of failure-free operation of the computer, in general:
P (t) = [FBG computer] = [FBG processor] * [FBG motherboard],
or, if we designate the computer failure rate as λ (t):
exp (-λ (t) t) = exp (-λ ₁ t) * exp (-λ ₂ t) = exp (- (λ ₁ + λ ₂ ) t),
from where
λ (t) = λ ₁ + λ ₂ .
Those. We got an important conclusion: The failure rate of the system is equal to the sum of the failure rates of its components and does not depend on time (for an exponential distribution, of course).

In our integer example, λ = 1/2 + 1/3 = 5/6 (years ^-1 ), whence MTTF = 1 / λ = 1.2 years. Knowing λ, it is easy to calculate the probability of failure of the entire computer, in general, during the first year:
Q (t = 1 year) = 1-exp (-1.2) = 70%,
and during the first two years:
Q (t = 2 years) = 1-exp (-2.4) = 91%.

Similarly, by simply summing the failure rates, one could calculate the MTTF system consisting of a larger number of components.

We emphasize once again that we are talking about, as they say, sequential (without redundancy) connection of elements, in which the failure of any element leads to system failure, as a whole. In this case, the system is usually divided into assemblies , for each of which the failure rate can be calculated.

The following screenshots show an example of using professional software for calculating the reliability and risks of Windchill Quality Solutions (Relex) . In practice, two situations are typical:

for components there is a known MTTF value, for example, indicated in the passport (highlighted with a blue frame)
MTTF component is unknown - then you have to take its assessment, based on the classifiers and directories (red frame in the screenshot)

Another important point is the dependence of the λ-characteristics on the operating conditions (heating, radiation exposure, pressure, etc.). In particular, for electronic components, the failure rate increases with increasing temperature. Data on the reliability of components are governed by standards that are different for different states and industries. Usually this information is collected in reference books and is presented in the form of appropriate interpolation formulas, for example (for resistors):

In conclusion, let me reiterate that we have considered the simplest case: a constant failure rate and a sequential scheme. In the following articles, we will look at how to increase the reliability of the system using redundant components.

Source: https://habr.com/ru/post/255407/

All Articles

Real-world tasks: how in practice do systems consider reliability (reliability, MTTF, failure rate)?

More articles: