Why SRE is important documentation. Part 2

Good evening everyone!

So nothing is left (that is, one day) before the launch of the course “DevOps practices and tools” , which means we need to finish adding the remaining parts of the article “Why SRE Documentation is Important” during this time.

We continue.
')
Documents for Onboarding New Service

SRE conducts a PRR (production readiness review, production readiness review) to verify that the service complies with the standards of operational readiness, and also to make sure that the service owners understand how to use SRE knowledge to manage large systems.

The service needs to pass this test before running in production. (Before launch, it is not supported by SRE, but by the development team itself.) The goal of PRR at this stage is to make sure that the service will meet the minimum reliability standards at the time of launch.

The next PRR occurs before the transfer of the SRE service, that is, it may take a long time after launch. And when the SRE team decides to take a new service, a thorough analysis of the state of production and the service practices used is carried out. The goal is to simplify the service transfer process in terms of reliability and operational sustainability, and to help the SRE better deal with it.

By conducting a PRR before delivering a service, SRE may ask more questions and set higher standards of reliability and ease of operation than when conducting a PRR before launch. PRR before launch can be “lightweight”, so as not to slow down the development team.

In the history of Zoe, the team did not have a standardized process or PRR checklist, which means they could miss out on very important issues when transferring the service. So there is a risk of colliding with a large number of problems that could be easily anticipated and solved even before taking responsibility for managing the service.

The PRR / service transfer requires the creation of PRR templates and process documentation (process doc) describing how SRE teams will work with the new service and use PRR templates. The templates used in the transfer process may be more exhaustive than those used during the launch.

The PRR template covers several areas and is needed to check for answers to critical questions. Table 1 shows some areas and related issues that the template covers.

Region	Questions
Architecture and dependencies	What is the path of the request from the client to the frontend, and then backend? Are there different types of requests with different delay requirements?
Capacity planning	What are the expectations about traffic volumes and growth rates during and after launch? Do you have the computing power necessary to support this traffic?
Types of failures	Are there single points of failure in your architecture? How do you eliminate the inaccessibility of dependencies?
Processes and automation	Are there any manual processes needed to support the service?
External dependencies	What third-party data, code, services, or events determine the service and launch? Do any of your partners depend on the service? If so, is it necessary to notify them about the launch?

Table 1. Example of PRR Template Areas

Process documentation should also include documents that SRE should request from the product development team as prerequisites for transmission. For example, they may ask the development team to create a playbook for standard problems.

In addition, an SRE organization will need to create an overview document, in general terms explaining to the development team the role and responsibilities of the SRE. This is necessary to form realistic expectations. The first document should explain what SRE is, cover all the topics covered in the last part and the beginning of this article, including basic functions, service life cycle, support / maintenance duties. The main purpose of this document is to make sure that the developers do not confuse the SRE with the OPS team and do not consider the answers to the pager as the sole duty of the SRE. As was shown in the previously described history, if at the time of the transfer of service, developers do not fully understand what SRE is doing, then this will lead to communication problems and misunderstandings.

In addition, you need to create an engagement model document to clarify expectations and explain how the SRE team interacts with the product development team during and after the service transfer. Topics covered in the documentation may be as follows:

Service Transfer Criteria and PRR process.
The process of discussing service level objectives (service-level objectives, briefly SLO) and the calculation of uncertainty.
New launch criterion and launch freeze policy (if possible).
Content and frequency of service status reports from the SRE team.
SRE personnel requirements.
The process of planning a roadmap of a new functional and the priority of a functional that increases the reliability (required by SRE) over a new functional of a product.

Service documentation

To support the service, SRE teams primarily rely on the main operational documentation: a general description (overview) of the service, a playbook and procedures, postmortem, directives and SLA. (Note: this section was present in the chapter “Do Docs Better” in Seeking SRE.)

Service overview

A general description of the service is critical to understanding SRE, what kind of service they support. SRE needs to know the system architecture, its components and dependencies, the contacts of the service and its owners. The general service description is the result of the collaboration between the development team and the SRE team, it is created to guide and prioritize SRE tasks and identify areas for further study. Such reviews are usually obtained as a result of PRR and should be updated as the service changes (for example, if new dependencies appear).

A simple overview gives SRE enough information to further explore the service. A full overview provides a thorough description of the service and how it interacts with the world around, as well as links to dashboards, metrics, all the information SRE needs to solve unforeseen problems.

Playbook

Sometimes called a runbook, it is a fundamental document that allows engineers to respond to service monitoring system notifications during a shift. For example, if Zoe's team had a playbook explaining the meaning of the “Job Ragnarok Lean” alert, and what to do in a situation where it was received, the incident would be resolved in a matter of minutes. Playbooks and reduce the time to eliminate the consequences of incidents, as well as provide useful links to consoles and procedures.

The playbooks contain instructions for checking, eliminating and escalating any generated notification of network monitoring processes. The names of the notifications in playbooks are usually the same as what the system generates. They contain commands and steps that need to be tested for accuracy. They need to be regularly updated when new ways to solve problems are discovered, as well as when new types of failure are detected and dependencies are added.

Playbooks are not created exclusively for notifications. They may also contain manufacturing procedures for release, monitoring and troubleshooting. Other examples of manufacturing procedures include service on / off, service maintenance, and accidents / escalation.

Postmortem

SRE works with large-scale, complex, distributed systems, as well as improving services with the help of new functionality and the addition of new systems. Therefore, given the scale and speed of change, incidents and failures are inevitable. Post-mortem is an important tool of SRE, the formalization of the learning process on its mistakes. In Zoe's hypothetical story, the team did not have a formal post-mortem procedure, so there was no formal process for recording the conclusions of the incident that would prevent its recurrence. So the team is doomed to repeat the same mistakes again and again.

SRE teams need to create a standardized post-mortem template with sections that capture important information about the failure. Ideally, the template should be structured so that it can be easily parsed with a data analysis tool. It reports on the dynamics of the crash using post-mortem as a data source. Each post-mortem created with this pattern describes a production failure, including the following information (minimum):

Chronology of events (timeline).
Description of the impact on the user.
The root cause.
Questions requiring decisions (action items) / lessons learned.

A post-mortem is written by a member of the team who is faced with a failure, ideally, those who participated in its elimination and can take responsibility for the improvements. Post-mortem should be written in a non-accusatory manner. It should contain the information necessary to understand what happened, as well as a list of decisions that need to be made to reduce the likelihood of recurrence, reduce consequences and / or simplify recovery.

Directives

The policy documentation specifies specific technical and non-technical rules and production directives. Technical rules can extend, for example, to logging changes in production, saving logs, naming internal services (naming rules that engineers must follow when implementing services), as well as the use and availability of emergency identification data.

Directives can also be directed to processes. Escalation rules help engineers to classify failures as emergency and non-emergency and understand what actions to take in a given situation; Shift expectations directives describe the team structure and responsibility of each of its members.

Service Level Agreement

The Service Level Agreement (SLA) is a formal agreement with the client about the service work provided and the actions taken in the event of default. SRE teams document the availability and latency of the service, as well as monitor the service performance associated with the SLA.

Documentation and publication of the SLA, as well as careful analysis of user experience and its comparison with the SLA, allows SRE teams to innovate faster without losing the quality of UX. SRE, supporting services with a clear SLA, notice failures faster and therefore eliminate them faster. SLA also reduces the amount of friction between the SRE and SWE teams (software developers), allowing teams to objectively discuss goals and results, avoiding subjective judgments about risk.

It is worth noting that the presence of an external, legally valid agreement may not apply to most of the SRE teams. In such cases, SRE teams may confine themselves to service-level objectives (service-level objectives, briefly SLO). SLO — Determine the desired level of service for a specific metric, such as availability or latency.

Product documentation

SRE teams strive to spend 50 percent of their time working on a project, developing software to automate manual work, or improve service reliability. This section describes the documents related to the product and the tools that develop SRE.

These documents are important because they allow users to understand whether this product is suitable for them, how to start working with it, and how to get support. They also provide the correct user experience and facilitate product adoption.

Product page “About”

The description page helps SRE and product development engineers understand what a product or tool is, what it does, and how to use it.

Concept guide

A concept guide or glossary defines all concepts unique to a product. Definition of concepts allows you to maintain consistency in the documentation and user interface elements, API and CLI (command line interface).

Getting Started Guide

The purpose of the start-up guide is to quickly bring engineers up to speed with minimal delays. This is useful for new users who want to try the product.

Codelab

Engineers can use these tutorials, which combine a theoretical explanation, code examples, and exercises, to quickly become familiar with the product. Codelabs also provide detailed scripts that lead engineers step by step through a series of tasks. Such tutorials are usually longer than getting started guides. They may cover more than one product or tool if they are interrelated with something.

Practical guide

This guide is necessary for users who want to solve a specific problem. These guides are usually a step-by-step instruction that must be followed.
FAQ

On the FAQ page, the user can get answers to the most popular questions, learn about the difficulties and limitations that can be encountered, find links to documents and other pages for more detailed information.

Support

On the support page, engineers can learn how to solve the problem they are facing. You can also find an escalation plan, troubleshooting information, group links, dashboards and SLO, as well as information about shifts.

API description

This guide describes functions, classes, and methods, usually with a minimum of accompanying text. Such documentation is most often created on the basis of comments in the code and sometimes written by technical writers.

Developer's Guide

From this guide, developers can learn how to program with a product API. Such guides are usually necessary if SRE create products that provide API to developers, which allows you to create mixed tools that call each other's APIs to perform more complex tasks.

Documents for service status reporting

This section describes the documents that the SRE command creates to describe the state of the supported services.

Quarterly service review

Information on the status of the service is presented in two formats: a quarterly report agreed upon by the SRE-lead, which is distributed throughout the SRE organization, and a presentation for the lead product development and team.

The SRE leaders are interested in quarterly reports, as they provide information on the following things:

Support problems (shifts, tickets, postmortems). The SRE leaders know that if support problems begin to take away more than 50 percent of the resources of the SRE team, you need to respond and change priorities. The goal is to identify the problem in the early stages.
Execution SLA. The SRE leaders will want to know if everything is fine with the SLA, whether there are unhealthy components in the ecosystem that pose a threat.
Risks. Lida SLAs want to know about the risks known to SRE, which can interfere with product and business objectives.

Quarterly reports give the SRE team the opportunity to:

Emphasize the benefits of SRE for the product development team, as well as the work of the SRE team.
Request priority to address problems that interfere with the SRE (sustainability) team.
Request feedback from SRE team priorities.
Emphasize the wider contribution of the team.

Review of successful techniques

This review helps to adopt successful techniques and come to a stable state in which the operation takes a minimum of time. To prepare such reports, SRE teams provide the site and team charter, details of shift status, projects vs. interrupts, capacity planning, and so on.

A review of successful methodologies helps the SRE team compare themselves with the rest of the SRE organization and improve performance in key areas: shift status, interruptions vs projects, SLO, and capacity planning.

The end of the second part.

Read the following part.

We are waiting for your questions and comments as usual.

Source: https://habr.com/ru/post/431436/

All Articles

Why SRE is important documentation. Part 2

More articles: