The first ten years on Google, I worked as an ordinary engineer: I started using public transportation on maps, improved my search and caught spam on YouTube. At some point, it turned out that in the neighborhood of the SWE (Software Engineers) teams there are some mysterious SRE (Site Reliability Engineers) who live in production and know everything about infrastructure, configs and monitoring. Usually they came to us with incomprehensible schedules and strongly recommended something to be rewritten in our service, so that it would explode neatly and in pieces, and not all together with all its neighbors. Or built some piece of infrastructure, magically solving all our problems once and for all. Or reported that the second release this week will not be, because one data center was washed away by a hurricane, and next to the other they buried a horse and cut the trunk cable. After some time, it became clear that you can come to these people with a wide variety of problems, and leave with solutions found by a couple of levels of abstraction lower than you expect from your own product ("you, of course, paid for the right amount of traffic, but here he stupidly does not fit into the switch at the top of the rack ”).
As a result, I wondered what all this SRE looks like from the inside, and I went to
Mission Control - a rotation program that allows you to spend six months in the role of SRE, gain valuable production-experience and, if you wish, return to your previous team to share the acquired knowledge. I stayed instead, as did two-thirds of my current Video Processing SRE colleagues, also re-trained from ordinary engineers. Now I myself scare SWE with incomprehensible graphs and evacuate YouTube videos from burning data centers, with breaks for peaceful creative coding. It turned out that in fifteen years a healthy and effective SRE-organization with its practices, principles and methods had grown up within Google - but nobody knows about them, because of those who got there, no one has yet come back.
The solution to this problem of the disappearance of information on duty, SLO and post-mortem in the black hole of Google SRE was the
book “Site Reliability Engineering” , describing in detail how this our SRE actually works. Actually, this whole post was started for two news:
- Two weeks ago a Russian translation of the aforementioned SRE book was released. If you are interested in how to create healthy DevOps practices in your company, this book is for you. If you suspect SRE-inclinations, then this book is even more for you.
- After the first book, I’ve just released (yet in English only) Site Reliability Workbook with practical examples from the Google Cloud Platform life - I also recommend it in every possible way.