📜 ⬆️ ⬇️

7 habits of successful Site Reliability Engineers (according to the version of New Relic)

Note trans. : This is a translation of an article from the New Relic company's blog , where throughout the year similar materials are published on various IT specializations related to software development and operation. The author is Kevin Casey, a freelance journalist and Azbee Award winner who writes for various publications and companies (including Red Hat).



In a recent publication, we looked at the rise of Site Reliability Engineer in modern software organizations. But being called SRE is one thing, but we would also like to know what it takes to succeed in this position.
')
Therefore, we decided to explore the characteristics and habits that are common to truly successful SREs. As with most development and operation positions, it is obvious that first-class technical skills are critical. For SRE, these specific skills may depend on how a particular organization defines or applies a position: Google's approach to Site Reliability Engineering may require more experience in software engineering and writing code, while in another company skills in operation or quality assurance may be of greater value ( QA). However, as it turned out, when studying what makes development and operation specialists successful, what separates the “greats” from the “good enough ones,” this is often a combination of habits and characteristics that complement technical expertise.

The seven habits presented below were obtained on the basis of detailed interviews with New Relic employees: Beth Long (Software Engineer) and Jason Qualman (Site Reliability Engineer). Let's get a look:

Habit 1: You analyze every change in context (much) of the larger picture.


Successful software developers understand how their code helps the entire business work. SRE has its own version of this trait. “You need someone who really thinks not only about everyday tasks, but also about the bigger picture. A successful SRE can understand and explain things to a higher level, ”says Jason. Inside New Relic, we describe people like “those who constantly analyze in every change the possible risks and its impact on the future, not just for today”. What does this mean for a large infrastructure?

Habit 2: You are pragmatic and farsighted in analyzing


The best SREs choose a pragmatic approach and assess how their work will affect the rest of the system or team. This approach minimizes the likelihood that the change "is thrown through the wall without understanding how it can affect the person sitting on the other side."

“We make decisions that are at a very low level of the entire stack. Sometimes they can hurt everyone above. You need to understand how solving a specific problem will affect everyone else who meets along the way, ”says Jason.

Habit 3: You want to keep moving when something is not helping


Part of a pragmatic approach for SRE is the desire to discard processes and operations that may be appropriate, but in reality are not effective. Beth recalls an example when New Relic changed its practices in the area of ​​reliability:

“A few years ago, we went through a stage of active growth and, in order to prevent any instability associated with this, we implemented the Change Acceptance Board (CAB) process [advice on making changes; apparently implied change advisory board - approx. trans. ] . It was designed to help us evaluate releases before their launch in production, in order to protect against changes that break something and cause incidents in the future. The irony was that as the release cycle slowed down, we began to accumulate more and more changes, the effect of which was completely opposite to what was intended. These larger changes increased the risks for each release. ”

In the end, the CAB process was thrown in favor of more frequent and smaller releases, which led to much better results.

Habit 4: You use every opportunity automation


High-grade SRE successfully cope with the main difficulty: how to increase the reliability of everything they do, without slowing down the company's ability to quickly deliver software. The solution is almost always automation. SRE needs to be proactive in finding solutions to time-consuming tasks, bugs, etc., with which manual interaction takes place, using new ways to automate or change the process.

“A significant component of this position is to think about inefficient and time-consuming tasks and eliminate them as soon as possible. Instead of postponing manual tasks, you say: “I will take the time to automate this right now and save everyone from having to engage in this painful activity,” explains Jason.

Obsessed focus on automation is not unique to New Relic: for example, The DevOps Handbook has a whole chapter on the paradoxical effects of accepting manual processes. In SRE job descriptions, “automation” and its various manifestations occur more often than any other words. A recent vacancy on SRE from Procore Technologies in Los Angeles, dealing with construction management software, has this second item in its description: “Automate, automate, automate and then ... automate!”. (Although only 4 days have passed since the original publication, the mentioned vacancy has already been closed, but this link can be used to find many other examples of “automate” in the description of SRE obligations for other companies - approx. Transl .)

Habit 5: You can convince the organization to do what is necessary


Confidence in upholding a specific task of automation or another SRE-initiative is another attribute that determines the best SREs. You should want to protect your position, why it is critical to automate a process, or for another part of the work. And this is not easy, because it can cause a clash with the culture and speed of work of many traditional organizations working in the field of software.


New Relic Team Rally in Portland

Good SREs live with their engineering-oriented version of the How-Win Friends and Influence People classic of self-help. Simply put, their job is to convince other people to do things they don’t initially want - for example, a software engineer to focus more on the problems that may arise when scaling a product for several next years.

The best SREs need to be effective salespeople, able to sell their colleagues the long-term benefits of automating a particular process or project, even if it can be determined that this will bring difficulties in the short term. Total? “You have to be able to defend your position and say“ stop ”or“ no, we really need to do it now, ”which may be difficult in some organizations,” explains Beth.

Habit 6: You expand your skills to include new tools and approaches.


Since the concept of SRE is still new, many SRE used to occupy other positions. Some SREs may have developer experience, others may have a traditional approach to exploitation. Jason and Beth point out that hiring managers are most effective who do not reduce the role of SRE to one particular past experience. For example, a traditional QA engineer can have good training for a SRE position.

Regardless of the past, there is a chance that the position of SRE will force you to leave the comfort zone and develop new skills. For example, it may be useful for an operating specialist to learn a programming language or three, and someone with experience in development will have to want and learn to think much more thoroughly about the processes and difficulties of operation than they used to do in the past. The best SREs take this path to learning and developing skills.

Habit 7: You trust the process


If for successful SRE there is some kind of guiding philosophy, then it can be expressed like this: in fact, you do not pursue the holy Grail, which will prevent everything from any breakdowns. This rarely works. Instead, you work tirelessly to see the big picture, implement automation, stimulate healthy patterns, learn new skills and tools, and improve reliability in everything you do. Perfection is not achieved, but the constant striving to make everything better is the way to follow.


American engineers New Relic on vacation

PS All company photos are taken from Glassdoor .

Source: https://habr.com/ru/post/342590/


All Articles