⬆️ ⬇️

"Reliability and dependability as in Google" - and not only: translation of the article "Calculation of reliability of service"

image



The main task of commercial (and non-commercial too) services is to be always accessible to the user. Although failures happen to everyone, the question is what does the IT team do to minimize them. We translated Ben Traynor, Mike Dalin, Vivek Rau, and Betsy Beyer’s article Reliability Service Calculation, which includes Google as an example, why 100% is a wrong guideline for the reliability indicator, what the “four nines” and in practice, how to mathematically predict the admissibility of large and small service outages and / or its critical components — the expected amount of downtime, failure detection time and service recovery time.



Calculation of service reliability



Your system is only as reliable as its components.



Ben Traynor, Mike Dahlin, Vivek Rau, Betsy Beyer



As described in the book “ Site Reliability Engineering: Reliability and Reliability as in Google ” (hereinafter - the book SRE), the development of Google products and services can achieve a high rate of release of new functions, while maintaining aggressive SLO (service-level objectives, service level goals). ) to ensure high reliability and quick response. SLOs require that the service is almost always in good condition and almost always fast. At the same time, SLOs also indicate the exact values ​​of this “almost always” for a particular service. SLOs are based on the following observations:



In general, for any software service or system, 100% is the wrong benchmark for the reliability index, since no user can notice the difference between 100% and 99.999% availability. Between the user and the service, there are many other systems (his laptop, home Wi-Fi, provider, power grid ...), and all these systems are not available in 99.999% of cases, but much less often. Therefore, the difference between 99.999% and 100% is lost against the background of random factors caused by the unavailability of other systems, and the user does not get any benefit from the fact that we spent a lot of strength, achieving this last fraction of the percent availability of the system. Serious exceptions to this rule are anti-lock braking systems and pacemakers!



For a detailed discussion of how SLOs relate to SLI (service-level indicators, service level indicators) and SLA (service-level agreements), see the “Target Quality of Service” chapter in the SRE book. This chapter also describes in detail how to select metrics that are relevant to a particular service or system, which, in turn, determines the choice of the appropriate SLO for that service or system.



This article extends the topic of SLO to focus on the component parts of services. In particular, we will look at how the reliability of critical components affects the reliability of a service, as well as how to design systems to mitigate impact or reduce the number of critical components.



Most of the services offered by Google are aimed at providing 99.99 percent (sometimes called "four nines") accessibility to users. For some services, the user agreement indicates a lower number, but within the company the target 99.99% is saved. This higher level gives an advantage in situations where users express dissatisfaction with the service performance long before the breach of the agreement, since the goal of the SRE team No. 1 is to make users happy with the services. For many services, the internal target of 99.99% is a "middle ground" that balances cost, complexity and reliability. For some other global cloud services in particular, the internal target is 99.999%.



99.99% reliability: observations and conclusions



Let's look at a few key observations and conclusions about the design and operation of the service with a reliability of 99.99%, and then proceed to practice.



Observation number 1: Causes of failure



Failures occur for two main reasons: problems with the service itself and problems with critical components of the service. A critical component is a component that, in the event of a failure, causes a corresponding failure of the entire service.



Observation number 2: Mathematics of reliability



Reliability depends on the frequency and duration of downtime. It is measured in:





Conclusion number 1: the rule of additional nines



A service cannot be more reliable than all its critical components combined. If your service seeks to ensure availability at 99.99%, then all critical components must be available significantly more than 99.99% of the time.

Inside Google, we use the following rule of thumb: critical components must provide additional nines compared to the declared reliability of your service — in the example above, 99.999% availability — because any service will have several critical components, as well as its own specific problems. This is called the "extra nines rule."

If you have a critical component that does not provide enough nines (a relatively common problem!), You should minimize the negative consequences.



Conclusion # 2: Mathematics of frequency, detection time and recovery time



Service cannot be more reliable than the product of the frequency of incidents at the time of detection and recovery. For example, three complete outages of 20 minutes per year result in a total of 60 minutes of inactivity. Even if the service worked perfectly during the rest of the year, 99.99% reliability (no more than 53 minutes of downtime per year) would have been impossible.

This is a simple mathematical observation, but it is often overlooked.



Conclusion from conclusions â„– 1 and â„– 2



If the level of reliability your service relies on cannot be achieved, efforts should be made to correct the situation, either by increasing the availability of the service or by minimizing the negative consequences, as described above. Lowering expectations (i.e., declared reliability) is also an option, and often the surest: let the service-dependent service know that it must either rebuild its system to compensate for the uncertainty in the reliability of your service, or reduce its own service level goals . If you do not correct the inconsistency, a fairly long system outage will inevitably require corrections.



Practical use



Let's consider an example of a service with a target reliability of 99.99% and work out the requirements for both its components and its handling of failures.



Numbers



Suppose your 99.99% available service has the following characteristics:





The mathematical calculation of reliability will be as follows:



Component Requirements





Trip Response Requirements





Conclusion: leverage to increase service reliability



It is worth looking closely at the presented figures, because they emphasize a fundamental point: there are three main levers to increase the reliability of the service.





Clarification of the "Rules of additional nines" for embedded components



A random reader can conclude that each additional link in a chain of dependencies requires additional nines, so two additional nines are required for second-order dependencies, three additional nine-nines are required for third-order dependencies, and so on.



This is the wrong conclusion. It is based on a naive model of a hierarchy of components in the form of a tree with constant ramifications at each level. In such a model, as shown in Fig. 1, there are 10 unique first-order components, 100 unique second-order components, 1,000 unique third-order components, and so on, which results in a total of 1,111 unique services, even if the architecture is limited to four layers. The ecosystem of highly reliable services with so many independent critical components is clearly unrealistic.



image

Fig. 1 - Component Hierarchy: Invalid Model



A critical component by itself can cause a failure of the entire service (or service segment), regardless of where it is in the dependency tree. Therefore, if a given X component appears as a dependency of several first-order components, X should be counted only once, since its failure ultimately leads to a service failure, regardless of how many intermediate services will also be affected.



The correct reading of the rule is as follows:





image

Fig. 2 - Components in the hierarchy



For example, consider the hypothetical service A with an error limit of 0.01 percent. Service owners are ready to spend half of this limit on their own mistakes and losses, and half on critical components. If the service has N such components, then each of them gets 1 / N of the remaining error limit. Typical services often have from 5 to 10 critical components, and therefore each of them can refuse only one-tenth or one-twentieth degree of the error limit of Service A. Therefore, as a rule, critical parts of the service should have one additional nine of reliability.



Error limits



The concept of error limits is described in some detail in the SRE book, but here it should be mentioned. Google's SR engineers use error limits to balance the reliability and pace of updates. This limit determines the acceptable level of failure for the service for a certain period of time (usually a month). Error limit is just 1 minus SLO service, therefore, the previously discussed 99.99% available service has a 0.01% “limit” for insecurity. Until the service consumed its error limit within a month, the development team is free (within reason) to launch new functions, updates, etc.



If the error limit is used up, service changes are suspended (with the exception of urgent security fixes and changes aimed at what caused the violation in the first place) until the service replenishes the margin in the error limit or until the month changes. Many Google services use a sliding window method for SLO, so that the error limit is restored gradually. For serious services with an SLO of more than 99.99%, it is advisable to apply a quarterly rather than monthly zeroing limit, since the number of allowable downtime is small.



Error limits eliminate inter-departmental tensions that might otherwise arise between SR engineers and product developers, providing them with a common, data-based product risk assessment mechanism. They also provide SR engineers and development teams with a common goal of developing methods and technologies that will enable them to innovate faster and launch products without “inflating the budget.”



Strategies to reduce and mitigate critical components



To this point, in this article we have established what can be called the "Golden Rule of Component Reliability . " This means that the reliability of any critical component must be 10 times higher than the target level of reliability of the entire system, so that its contribution to the unreliability of the system remains at the level of error. It follows that, ideally, the task is to make as many components as possible non-critical. This means that components can stick to lower levels of reliability, giving developers the opportunity to innovate and take risks.



The simplest and most obvious strategy for reducing critical dependencies is to eliminate single points of failure (SPOF) whenever possible. A larger system should be able to work acceptably without any given component that is not a critical dependency or SPOF.

In fact, you most likely cannot get rid of all critical dependencies; but you can follow some system design recommendations to optimize reliability. Although it is not always possible, it is easier and more efficient to achieve high system reliability if you build reliability during the design and planning stages, and not after the system is working and affecting actual users.



Evaluation of the project structure



When planning a new system or service, as well as when redesigning or improving an existing system or service, a review of the architecture or project may reveal a common infrastructure, as well as internal and external dependencies.



Shared infrastructure



If your service uses a shared infrastructure (for example, the main database service used by several products available to users), consider whether this infrastructure is used correctly. Clearly identify the owners of the shared infrastructure as additional project participants. In addition, beware of overloading components — to do this, carefully coordinate the launch process with the owners of these components.



Internal and external dependencies



Sometimes a product or service depends on factors beyond the control of your company — for example, software libraries or third-party services and data. Identifying these factors will minimize the unpredictable consequences of their use.



Plan and design systems carefully

When designing your system, pay attention to the following principles:



Backup and Isolation



You can try to reduce the impact of a critical component by creating several independent instances of it. For example, if storing data in a single copy provides 99.9 percent availability of this data, storing three copies in three widely dispersed copies will provide, in theory, an accessibility level of 1 to 0.013 or nine nines, if the instance fails at zero correlation.



In the real world, the correlation is never zero (consider the backbone failures that affect many cells at the same time), so actual reliability will never approach nine nine, but will far exceed three nine.



Similarly, sending an RPC (remote procedure call, remote procedure call) to one server pool in one cluster can provide 99% availability of results, while sending three simultaneous RPCs to three different server pools and accepting the first incoming response helps to achieve accessibility. taller than three nines (see above). This strategy can also shorten the response time delay tail if server pools are equidistant from the RPC sender. (Since the cost of sending three RPCs at the same time is high, Google often strategically allocates the time for these calls: most of our systems expect a portion of the allotted time before sending the second RPC and a little more time before sending the third RPC.)



Reserve and its application



Configure the launch and transfer of software so that the systems continue to work when individual parts fail (fail safe) and isolate themselves when problems arise. The basic principle here is that by the time you connect the person to enable the reserve, you will probably already exceed your error limit.



Asynchrony



To prevent components from becoming critical, design them asynchronous wherever possible. If the service expects a response RPC from one of its non-critical parts, which demonstrates a sharp slowdown in response time, this slowdown will unnecessarily worsen the performance of the parent service. Setting RPC for a non-critical component to asynchronous mode will release the response time of the parent service from binding to the indicators of this component. And while asynchrony can complicate the code and infrastructure of the service, yet this compromise is worth it.



Resource planning



Make sure all components are fully equipped. If in doubt, it is better to provide an excess reserve - but without increasing costs.



Configuration



If possible, standardize component configuration to minimize discrepancies between subsystems and avoid one-time failure / error handling modes.



Detection and troubleshooting



Make error detection, troubleshooting, and troubleshooting problems as simple as possible. Effective monitoring is the most important factor in the timely detection of problems. Diagnosing a system with deeply embedded components is extremely difficult. Always keep on hand such a way to level mistakes that do not require detailed intervention of the duty officer.



Fast and reliable rollback to the previous state



Including the manual work of the duty officers in the plan for eliminating the consequences of failures significantly reduces the possibility of fulfilling the hard goals of SLO. Build systems that can easily, quickly and smoothly return to the previous state. As your system improves and confidence in your monitoring method grows, you can lower MTTR by developing a system to automatically launch safe kickbacks.



Systematically check all possible failure modes



Examine each component and determine how a failure in its operation can affect the entire system. Ask yourself the following questions:





Do thorough testing



Develop and implement a developed testing environment that will ensure that each component is covered with tests, including the main use cases of this component by other components of the environment. Here are some recommended strategies for such testing:





Plan for the future



Expect scaling-related changes: a service that begins as a relatively simple binary file on one computer can have many obvious and non-obvious dependencies when deployed on a larger scale. Each scale order will reveal new constraints — not only for your service, but also for your dependencies. Consider what happens if your dependencies cannot scale as fast as you need.

Also keep in mind that system dependencies evolve over time and the list of dependencies may increase over time. When it comes to infrastructure, Google’s typical recommendation is to create a system that will scale up to 10 times the initial target load without significant changes in the architecture.



Conclusion



While readers are probably familiar with some or many of the concepts described in this article, concrete examples of their use will help to better understand their essence and convey this knowledge to others. Our recommendations are not easy, but not unattainable. A number of Google services have repeatedly demonstrated reliability above four nines not due to superhuman efforts or intelligence, but through thoughtful application of principles and best practices developed over many years (see SRE book, Appendix B: Practical recommendations for services in industrial operation).



')

Source: https://habr.com/ru/post/435662/



All Articles