Site Reliability Engineering: Google's wisdom anthology or new word in DevOps

Hello, dear readers!

We believe, not only we are interested in the book " Site Reliability Engineering ", written by a large team of authors from Google. Not only does it continue to occupy the first lines of all kinds of Amazon ratings; the most interesting thing is that it gives really accessible and comprehensive information about the flawless operation of systems of any complexity.

')
Moreover, in the future we are also interested in a more general overview book on the DevOps methodology, the output of which we are looking forward to:

Since we are practically convinced that a monitor with a bull will make a perfect pair, it remains to hope for no less reader’s interest in SRE and DevOps. We propose to study a slightly abbreviated review of the book “Site Reliability Engineering”. The author of the article, Mike Dougherty, is one of the co-authors of the book, who partially read it.

About 2 years, Google has been working on the book “Site Reliability Engineering”. This is a special discipline and organization of workflows, with the help of which Google ensures the smooth and uninterrupted operation of all its colossal systems. This book has just been published in the original. Its volume is more than 500 pages, and its pages describe in detail exactly how Google works. The authors write extremely openly, do not hide the names of projects and systems, explain how these systems function. Yes, you will not find the source code on the pages, but you will be able to adopt a mass of techniques that will not only be useful on Google. The book will be very interesting to start-up employees who want to grow a large company, as well as employees of medium and small technology companies who want to increase the reliability of their services.

I admit at once that I worked as a SRE-engineer at Google and participated in writing a small fragment of chapter 33, which describes how the principles of SRE can be applied in the non-technical field.

Of course, this book did not open America for me, because I worked in the very organization in the depths of which it was born. But I'm interested in what Google is trying to convey to the rest of the technical community. Google has long been trying to clearly describe what SRE is. It was precisely such uncertainty that puzzled me when I got a job in Google in this specialty. But Google does an exceptionally good job with the operation of giant systems, and SRE is a set of practices and methods that made it possible to form such a technological culture that provides similar efficiency.

I hasten to please you: even if you don’t have systems like Borg or Chubby, you can still do many things that SRE engineers at Google do. The book contains a lot of practical advice on how to properly build such a work, what to do and what not to do (by the way, the book pleasantly surprises with how sensibly the mistakes made are considered in it).

As far as I know, all the technologies mentioned in the book have already appeared in open sources. In recent years, there have been articles and lectures on Piper, Borg, Maglev, etc., so the authors also freely talk about them. Specific technologies are interesting as material for case situations, but the most interesting are not individual products or systems as such, but information on how Google implemented these projects in accordance with the principles of SRE. So this is a book about SRE, not about specific systems. Most of the material in the book is devoted not to ready-made systems, but to the principles and practices that the reader can use. True, these principles and practices work better not individually, but as a coherent whole. Fortunately, in the book there will be good advice for various readers, I will describe the target audience in more detail in the final part of this review.

OVERVIEW

The book is divided into five parts: Introduction, Principles, Practices (the most voluminous section, it accounts for about 60% of the volume of the book), Management and Conclusion. I would like to briefly talk about individual chapters that I find particularly interesting or valuable to the reader. If this part doesn’t interest you, you can skip it and go straight to the “A Little Thinking” section, where I discuss what the SRE book can be useful for a particular reader.

“Introduction” is important to read, as it sets the context for discussion of all further topics, so I strongly recommend that you do not skip it. The first chapter explains what SRE is, and also points out the differences between SRE, system administration, and DevOps. The second chapter outlines how Google’s work environment is built — from Borg to data warehouses, networks, and development environments.

The Principles section is based on the material in Chapter 1 and begins with the topic of risk management. This material is essential to understand the strength and resilience of SRE systems. If we wanted 100% stability, we would simply not allow the developers to change anything. But it kills the business. In fact, we learn to manage the level of risk that we take on ourselves and to work as quickly as possible. At the same time, damage is not excluded, as long as they fit into our repair budget.

Next, I’ll mention chapters 6 and 10, which explain in detail how Google tracks the performance of its colossal systems, and how we get alerts about problems that arise (it also explains the meaning of the notion of “wrong”, which is very important). Probably, the problem of monitoring is no less complicated than installing the very systems that need to be monitored, and the solution to this problem is the art of (system) programmers.

Chapter 7 discusses how big the role of automation is with SRE. In systems as large as ours, the value of automation cannot be overestimated, but as Google continues to grow, we are striving to create new systems whose capabilities go far beyond automation. Only in this way can we hope to cope with the operation of our largest current systems, as well as those systems that will appear in the future.

One of the most important aspects of SRE is appropriate culture. This topic is occasionally touched on throughout the book, but in this context the “anatomizing culture” is the most important. Chapter 15 explains what it is and why anatomy must be flawless.

Chapter 17 discusses reliability testing. This is one of the few chapters that deceived my expectations. Although it covers such important topics as load testing and “bells”, these topics are not discussed in detail. Maybe the authors simply did not want to go into details, or the material had to be reduced (if so, I would rather shorten some other fragment), but, somehow, the sediment remained.

This is followed by four chapters explaining how Google organizes load balancing at various levels (chapters 19 and 20), how to handle overloads and avoid cascading failures (chapters 21 and 22). All these topics are strongly interconnected; they certainly deserve 60 pages. We have standard server and client implementations of feedback (backpressure), weighted load balancing based on the carousel principle, partial backup of databases (subsetting), priority and criticality of requests, as well as load segmentation, cost of requests and much more. All these mechanisms are important for avoiding overload and cascading failures, so it’s better to sort them out from such a book than to learn from your own mistakes.

The next two chapters, 23 and 24, deal with distributed-negotiable systems, and Borgcron, Google’s distributed cron service, which works on the basis of this reconciliation. Working with distributed cron is more difficult than it seems, so the reader will have an instructive tour of the multi-level structure, which, when built, cron from a single machine turns into a Borgcron.

Part 4 is about managing SRE teams. Since this material was not as interesting to me as the technical part of the book, I will immediately turn to Chapter 32, “Developing Developing Services: Framework and SRE Platform.” We believe that this kind of platform standardization work is critical to scaling SRE, and therefore Google systems.

Part 5 outlines the story of how high reliability is achieved in other industries. It was this chapter that I was asked to review before publication. I'm not sure how much it enriches the book, but it’s still important to trace the common features in ensuring the reliability of various systems - thus, the authors prove that Google SRE is developing in the right direction.

REFLECTIONS

So, after a fairly brief overview of the most important parts of the book, I would like to talk about how it is especially valuable. Maybe Google just boasts, or the reader will endure something from this book? Will the techniques described in the book be useful for employees of small companies? I suppose so. Here you will find a lot of practical advice for open source projects, for work in small and large technology companies, where there is no such mature SRE system as in Google.

Many of the development and testing techniques described in the book are easy to implement in free projects. Systems need to be designed with consideration of feedback and graceful degradation, transparent monitoring, extensive testing, not limited to modular tests, etc.

How can such a book be useful in small companies where only a few engineers, typical system administrators, are responsible for operating the systems? First, it may show a different path. This is not about realizing all these opportunities, such a task seems to me overwhelming. But it is important to learn the principles themselves. The hiring and training of sysadmins should be modified in order for these employees to begin writing those elements of the programs that might be lacking. A sysadmin must flawlessly anatomize errors, as well as correct all existing elements that might not work if the system failed. When the budget for the correction of errors begins to run out, the sequence of releases must be slowed down. In particular, pay attention to chapter 30, “We take up the SRE arsenal to smooth out the operational overload”. See also chapters 1 and 28.

In larger companies, where there are enough talented engineers, but the SRE process is not properly organized, the book can be useful in different ways; it all depends on what, in your opinion, the organization falls short from the engineering point of view. Maybe you still have a special operational department, maybe you are already engaged in DevOps, but there may always be a time when the need to change the engineering structure of the organization is ripe. Suppose you do not have standards for dealing with congestion and cascading failures - then invest in the creation of such development cycles that will allow you to create such an infrastructure. If there are problems with load balancing - read how it is done in Google.

Finally, I must mention that the book is read very easily.

I highly recommend this book to all DevOps, support, reliability, and development specialists for large-scale software projects. The book helps not only to look behind the scenes of Google, which in itself is priceless, but also contains good advice from Google experts, literally at first hand. Believe me, all the ideas stated in it are absolutely realizable.

Source: https://habr.com/ru/post/281673/

All Articles

Site Reliability Engineering: Google's wisdom anthology or new word in DevOps

More articles: