An example of the calculation of the "availability factor" for the IT-system

Task: in the Technical Specification for an integrated IT-system there was a clause - “perform the calculation of the system availability ratio”.

Solution: use materials from GOST, request additional data from vendors for equipment items and use simple math to perform the final calculation.

Normative references:

GOST R 27.002-2009 ("Reliability in engineering (CLA). Terms and definitions")
')
GOST R 27.003-2011 Reliability in Engineering (CLOT). Reliability management. Reliability Specification Guide

GOST 27.002-89 Reliability in engineering (SAST). Basic concepts. Terms and Definitions

According to GOST R 27.002-2009 ("Reliability in engineering (CLA). Terms and definitions") availability factor (in the field of reliability in engineering) is the probability that the product is currently in working condition, determined in accordance with the design under specified conditions of operation and maintenance .

Thus, readiness reflects the ability of the system to continuously perform its functions.

In general, for information and computer devices, availability is the probability that a computer system will be in working condition at any (arbitrary) time.

Availability factor (K) is determined by the formula:

K = MTBF / (MTBF + MTTR) ,

Where:
- MTBF (Mean Time Between Failure) - mean time between failures (mean time between failures);
- MTTR (Mean Time To Repair) - the average recovery time (the average time to recovery).

Unlike reliability, the value of which is determined only by the MTBF value, availability also depends on the time required to return the system to a working state.

So, we have a certain IT-system (rack-mount execution server, blade server, data storage system).

Fault tolerance at the level of equipment of such an IT system allows its services to continue operation in case of hardware failure of individual components of the server hardware, data storage system or infrastructure.

Fault tolerance of the functioning of the internal components of an IT system is achieved by using the following technologies:

redundancy of power supply units of server equipment, data storage systems;
redundancy network server adapters;
redundancy optical server adapters;
redundancy of lines of cable connections switching servers and data networks and storage networks;
Blade chassis modules duplication: power supplies, control modules, fans, switching modules;
information placement on disk storage systems using fault-tolerant disk groups (RAID).

As a result, all the main components of the IT-system equipment - servers, power supplies, disk drives, network adapters, switches - have hot-swappable redundancy.

The power supply of the IT-system equipment is carried out from two independent sources. Connecting IT equipment to external data networks and storage networks is also duplicated.

All subsystems of IT-systems have redundancy, therefore, if any element fails, the equipment of the IT-system as a whole will remain in a healthy state. Moreover, the replacement of the failed element is possible without stopping the equipment of the IT system.

The probability (P) of one component failure within one year is:
P = 1 / MTBF.

Failure of the duplicated component will lead to equipment failure only on condition that the duplicate component also fails during the time required for hot-swapping the component that failed first. If the guaranteed component replacement time is 24 hours (1/365 year) (which corresponds to the current practice of servicing server hardware), then the probability of such an event during the year:

By calculating the probability of failure of all N components of the IT system equipment, it is possible to calculate the probability of failure of the IT system equipment within one year by summing up each probability of failure:

Since component failures are usually distributed in time evenly, then, knowing the probability of failure of an IT system equipment during a year, it is possible to determine the time between failures:
MTBFs = 1 / Ps.

The availability of equipment of the IT system will be equal to:
Kit = MTBFs / (MTBFs + MTTR).

Perform the calculation of the availability of equipment IT-system of 26 components (each component has several elements).

The main problem in the table below is the actual data on the MTBF parameter for each component. These data are very reluctant to provide vendors. Often it is necessary to enter into correspondence with representatives of vendors to request the provision and clarification of these data.

In the table below, the calculation is made for the “outdated” IT system, but now it has been operating for nearly five years in combat mode without component failure, but the Customer is already planning to migrate to new components without waiting for the deadline of the final calculation data.

(*) - MTBF baseline data are estimates provided for the manufacturer’s equipment items or their equivalents.

As a result, the calculated data on the equipment of our system:

probability of system equipment failure during the year: 0.0966;
MTBF equipment system (years): 10.35 (90666 hours);
average troubleshooting time (hours): 24;
system equipment availability ratio (%): 99.97;
Average downtime per year (hours): 2.61 (156 minutes).

From the summary lines in the table, you can see that we have non-duplicated storage elements, and this moment has a great effect on the calculated data. If possible, duplicate these elements (as a recommendation) or use a different storage layout.

This calculation is, of course, very estimated. But the basic understanding that the system is optimal or needs additional elements can provide.

In fact, these tables with calculations are entered in the desired section of the project documentation and are issued to the Customer.

It is interesting to perform such a calculation for a set of network equipment (with maximum partitioning into elements up to the SFP module and power supply units) and compare the resulting data with different vendors.

Source: https://habr.com/ru/post/418769/

All Articles

An example of the calculation of the "availability factor" for the IT-system

More articles: