Designing reliable databases. Chapter 1. Introduction

Chapter 1. Introduction

The purpose of this book is to provide a development guide on the way to becoming a true database reliability engineer (DBRE). In the title of the book, we specifically used the word engineer, not an administrator.

Ben Traynor (Google engineer) described this activity as follows:

Basically, this is work that was historically performed by the operations team, but with the involvement of engineers with their experience in software design, as well as the desire and ability to automate human labor.

Today, professional DBAs must be non-administrative engineers. We build and create. In accordance with the DevOps principle, we are all in the same boat, and there is no “alien” problem. As engineers, we apply knowledge and expert judgment to design, build, and use data warehouses and data structures in them. As a DBRE, we must apply the basic principles and in-depth knowledge that we possess more than others.
')
If you look at today's non-storage infrastructure, you will see systems that are easily created, executed, and destroyed programmatically and often automatically. The life cycle of these components can be calculated in days or even minutes. When one disappears, others appear in its place to maintain the quality of the service at the required level.

Our next goal is to learn the principles and practices of designing, constructing and managing data warehouses in the paradigm of designing reliable systems and the DevOps culture. You can apply this knowledge to any database technology you have been asked to work with at any stage of your organization’s development.

DBRE Guidelines

When we started writing this book, one of the first questions we asked ourselves was: what principles underlie the new stage in the development of the DBA profession. If we redefine the approach to designing and managing data warehouses, then we must also determine the basics of behavior that we support.

Data protection

Traditionally, data protection has always been and still is the main principle of the DBA profession. Usually the goal was achieved by:

strict division of responsibilities between DBAs and developers;
regularly tested backup and restore procedures;
security procedures with regular audits;
expensive DBMS with guarantees of reliability and stability;
expensive data storage system (DSS) with fault tolerance of all components;
extensive control of change and administrative tasks.

In teams that are used to working together, strict separation of duties can be not only a burden, but also a brake on development. In Chapter 8, Release Management, we discuss methods for creating “safety nets” and reducing the need for segregation of duties.

More often than ever, architects and engineers choose open DBMSs that cannot guarantee the same reliability as Oracle, for example. Sometimes this gives advantages in performance and scalability. Choosing the right DBMS and understanding the consequences of this choice are discussed in Chapter 11. Understanding that there are different tools, and the ability to effectively select them quickly becomes the norm.

Storage subsystems also undergo significant changes. In a world where systems are often virtualized, networks and ephemeral storage systems are used in database design. This will be discussed further in Chapter 5.

Battle DB on ephemeral storage systems.

In 2013 Pinterest has moved its databases from MySQL to ephemeral Amazon Web Services (AWS) storage systems. Ephemeral storage systems mean that if the virtual machine crashes or shuts down, the entire disk contents disappears. Pinterest chose ephemeral storage due to consistently high bandwidth and low latency.

Such a choice required both significant investments in the automated and reinforced concrete backup, and the ability of the computational nodes to do without the storage subsystem for a short time. Ephemeral storage systems do not support snapshots, so the only recovery method was to copy a full backup over the network instead of rolling the transaction log to the snapshot.

This example demonstrates that it is quite safe to manage data in ephemeral storage systems if you use the right methods and tools.

New approaches to data protection may look more like this:

responsibility for the data is divided between cross-functional teams;
regulated automated backup and restore processes used by the DBRE team;
Regulated security policies and security procedures applied by the DBRE and Security team;
all policies are applied automatically;
data requirements and fault tolerance determine the choice of DBMS;
automatic processes, common practices and resiliency instead of expensive, sophisticated equipment;
Changes are built into deployment with special attention to testing, rolling back and reducing impact.

Self-serviceability for scalability.

A talented DBRE, of course, is a rarer product than a site reliability engineer (SRE, referring to Google’s Site Reliability Engineering book). Most companies can not afford to contain more than one or two. So we have to create as much value as possible, which is achieved by creating self-service platforms. Based on standards and tools, teams can launch new services and make changes with the required speed, without resting on the overloaded DBA. Examples of self-service methods:

collecting correct metrics from the data warehouse;
creating backup and restore tools that can be deployed for new data warehouses;
defining reference architectures and data warehouse configurations;
Collaborate with the security department to define standards for data warehouses;
creation of methods for safe application of changes to the database with their preliminary testing.

In other words, an effective DBRE helps others by guiding them, rather than serving as a watchman.

Freedom from hard work

Google's SRE team often uses the phrase “Elimination of Toil” (Elimination of Toil), which is discussed in Chapter 5 of Google’s Site Reliability Engineering. In this book, hard work is defined as work associated with the launch of combat service, which tends to be manual, repetitive, with the possibility of automation lacking sustainable value, and which grows linearly with the growth of the service itself.

To relieve the DBRE team of hard work, you need an effective application of automation and standardization. In this book, we will give examples of the hard work specific to DBRE and the method of getting rid of it. Hard work, of course, a vague concept with many prejudices that vary from person to person. In this book, we define it as a manual non-creative work that repeats and does not require brain work.

Manual changes in the database.

Many clients of database engineers are asked to see and apply changes to the database, which may include changes to tables and indexes, adding, modifying, deleting data. Everyone is confident that DBA apply these changes and monitor the effects in real time.

One client had a lot of such changes in the database. We came to the conclusion that 20 hours a week were engaged in their use. Needless to say, the unfortunate DBA, who spent half of his working time doing stupid monotonous work, got tired and left.

Faced with a lack of hands, management finally allowed the database administration team to create a developer tool that automates the application of the change package after one of the administrators has reviewed and approved it. Soon, everyone began to trust the new tool, which enabled the DBRE team to focus on integrating these processes with the deployment processes as a whole.

Databases are not special snowflakes.

Our systems are no more and no less important than any other components serving the needs of the business. We must fight for standardization, automation, and flexibility. It is important to understand that the components of the database are not sacred. We must be able to lose any component and replace it without much difficulty. Fragile data storage in the glass room is a thing of the past.

To show the difference between special snowflakes and a service component, pets can be compared with cattle. The “pet” server is the one you feed, take care of and nurture when it is sick. He also has a name. In company
Travelocity in 2000, our servers were called heroes from the Simpsons, and our two Oracle servers were named Patty and Selma. I spent many nights with these girls. They were still those kept women!

“Krupnorogorie” servers do not have names - they have numbers. You do not spend time setting up servers, visiting each host. When someone gives signs of illness, you take him out of the herd and keep him close to the forensic examination.

Data warehouses - one of the last pets. Still, they keep the "Data", and simply can not be replaced by cows with a short life cycle and full standardization. What about the special rules for replicating our replica for reports? What about the special configuration of the replica for the fault tolerance of the primary node? They have different tasks.

Removing barriers between development and operation

Your infrastructure, configurations, data models, and scenarios are components of the software. Train and participate in software development as any engineer would do. Write code, test, integrate, collect, test and deploy. We have not forgotten about testing?

For those involved in administering and writing scripts for the backend, this can be a complex paradigm shift. In the traditional environment, the processes of designing, building, and testing products are divided between developers, system administrators, and DBAs. The paradigm shift under discussion eliminates differences in views on the organization of the process so that DBRE and system administrators want to do their job in similar ways.

Developers must learn to administer!

Administrators are often told to learn to program or “go home.” Although in general I agree with this, but the opposite should be true. Developers who do not understand the principles of infrastructure management will create fragile, non-productive and potentially unsafe code.

DBRE can be integrated directly into the development team, working on the same code, checking how the code interacts with the databases, changing the code for speed, functionality and reliability.

Removing these organizational barriers increases productivity and development speed over traditional models, and DBREs must adapt to these new processes and cultures.

Operations

One of the core competencies of DBRE is operations. It includes the design, testing, assembly and operation of any systems with complex requirements for scalability and reliability. Those. if you want to be a database engineer, you need to know these things.

At the macro level, exploitation is not a role. Operation is the sum of all the knowledge, skills and values that your company has built around the practice of managing quality systems and software products. These are your hidden values, as well as explicit values, habits, joint experience, reward system. Everything from support to CEO are involved in operating results.

Too often this is not done very well. In many companies, the culture of exploitation is so terrible that it burns anyone who comes close to it. Despite this, your operating culture is a sudden appearance of your company and how it relates to its technical mission. Thus, if you say that your exploitation is not very good, we simply will not make a deal.

Perhaps you are a developer or supporter of the “as a service” infrastructure. Maybe you doubt that the principles of operation are mandatory for the fearless DBA. To think that a cloud computing model frees a developer from exploitation issues is wrong. Actually quite the opposite. This is a new fearless world, where you don’t have any administrative administrators, where this work is done by engineers from Google, Amazon, PagerDuty, DataDog, etc. In this world, developers need to understand better in administration, architecture, and performance than they are now.

Hierarchy of needs

Some of you came up with this book with experience in corporations, and some - in startups. Just as we consider other systems, it is worth thinking about what you will do on the very first day when you take responsibility for the databases? Do you have backups? They are workers? Are you sure? Is there a replica to which to switch? Do you know how to do this? Is it on the same power cable, router, hardware, where is the main node? Do you know if backups stop running? How do you find out about this?

In other words, we need to talk about database needs.

For people, Maslow’s hierarchy of needs has been invented — a pyramid of desires that must be satisfied in order to feel successful: physical survival, safety, love and belonging, respect and self-actualization. At the heart of the pyramid are the most basic needs, such as survival. When they are satisfied, we come to self-actualization, when we can safely study, play, create and achieve the full unlocking of our unique potential. This is for people. Now let's apply this approach to databases.

Survival and safety

The basic needs of your database are backups, replicas, and the ability to switch. Do you have a database? Is she working? Pinged? Is the application responding? Backed up? Does recovery work? How do you know if you stop working?

Is your data safe? Are there copies of your data? Do you know how to switch? Copies of data are on different equipment with different power supply? Are the copies consistent? Can you recover at some point in time? How do you know if the data will be corrupted?

We will look at these questions in more detail in the chapter about backup and recovery.

It is also worth preparing for scaling. Of course, it’s not worth scaling in advance, but growth should be considered in the same way as we define identifiers for key data objects, storage systems and architecture.

Types of scaling

We will discuss scaling often enough. Scalability is the ability of a system or service to cope with increasing work. This may be the actual ability, if the whole system was built with growth in mind, or it is a potential ability if the architecture provides for the addition of the resources and components needed for growth. There are four common ways to scale:

vertical, by adding resources (scale up);
horizontal, due to the duplication of systems or services (scale out);
load sharing into small parts by functions, so that each of them can scale independently (functional partitioning);
load sharing into identical parts, differing in the data set, which is being worked on (sharding).

Specific aspects of these approaches will be discussed in Chapter 5, Infrastructure Design.

Love and belonging

Love and belonging is to turn your data into first-class citizen objects of the software development process. This stop isolating databases from other systems. This is a technical and cultural question, so it can be called “DevOps Requirements”. At the top level, this means that managing your databases should look and feel (as far as possible) as managing all other systems. It also means that you encourage volatility and cross-functionality. The stage of love and belonging is when you gradually stop logging in and rudely rude. This is when you start using the same code analysis and deployment practice.

The database infrastructure should be part of the same process along with the other components of the architecture. Working with data should be consistent with all other parts of the application, which should make anyone feel that he can cope with database support.

Fight the desire to instill fear in the developers. It is very tempting to feel that you have everything under control. Not really - you have no control. It would be much better for everyone to channel energy into the creation of “handrails” so that it was difficult to break something by chance. Train and let anyone make their own changes. Accept the fact that there will be failures. In other words, create an elastic viable system and encourage everyone to work with databases as much as possible.

"Handrails" in Etsy.

Etsy introduced a tool called Schemanator for making safe changes to the database in their battle environment. To allow developers to apply their changes, many handrails were used, such as:

heuristic analysis of changes to verify compliance with standards;
checking that the modified scripts succeed;
preflight preparation showing the developer the current status of the system;
alternately applying the changes to the load output servers;
splitting changes into subtasks so that you can stop the process in case of unforeseen problems.

You can read more on the blog.

Respect

Honor is the highest need for a pyramid. For people, this means recognition of skill. For databases, it is the possibility of monitoring, debugging, introspection and instrumentation equipment. It is the ability to understand storage systems on their own, as well as relate events across the entire stack. There are two sides to this stage - your services and your people.

Your services should let you know if they are falling, rising, or experiencing problems. You do not have to look at the graphics to find out. In a developed system, the rate of change decreases, and the behavior becomes more predictable. In a combat environment, you learn every day the weaknesses of storage systems, behaviors and conditions that lead to failures. This can be compared to the adolescent data infrastructure. Most of all, you need to be able to see what is happening. The more complex the product, the more moving parts there are and the more monitoring tools need to be developed.

You also need buttons and levers. You need the ability to selectively reduce the quality of service instead of cutting it down completely. For example:

transfer to read-only mode;
disable some features;
setting write operations to the queue for deferred execution;
the ability to block the detected pests - sources of problems.

Your people have similar, but not identical needs. It often happens that people react too much when it comes to selling. They lack understanding of what is happening, and they begin to monitor everything, going as far as viewing hundreds of graphs, most of which are meaningless. If there is no useful signal among this noise, and people are forced to guess the reason, looking through the logs - this is as bad as the complete absence of graphs.

In this case, you can start “burning up” your people - interrupt, wake up, teach you not to react to the received alerts. At an early stage, you expect everyone to be on the phone. When the loop is tightened, you create conferences, pushing people out of their comfort zone, helping a little.

Self-actualization

Just as each person who fully revealed his talents is unique, unique and developed storage systems in each organization. Ideal theoretical storage systems for Facebook, Pinterest and Github look different, although they may coincide at the startup stage. But just as there are common features in healthy developed people (they are not satisfied with hysteria in stores, eat healthy food and play sports), there are also common features in healthy developed storage systems.

In this sense, self-actualization means that your data warehouses help you get what you need without interfering with progress. They enable developers to do their work and avoid mistakes. Banal and common problems should eliminate themselves without human intervention. This means that scaling works and copes with a 10-fold growth every year so that only after three years you think about capacity and performance. Frankly, your storage system can be called mature when most of the time you think about other interesting things, such as new products or preventing future problems, instead of eliminating current ones.

It is normal to move forward and backward between levels. Levels, for the most part, are needed to prioritize. For example, confidence in the availability of working backups is much more important than writing a script for automatic sharding and adding capacity. If you still have only one copy of the data online, or you don’t know how to switch to a backup in case of a failure, you should stop doing something else and deal with these primary things.

Summarizing

The role of DBRE is a paradigm shift with respect to the existing and understandable role of DBA. This platform gives us a new approach to database management functions in an ever-changing world. In the following chapters we will look at these functions in detail, prioritizing between the tasks of daily work. With these words, my brave engineers, let's move forward boldly!

Source: https://habr.com/ru/post/350084/

All Articles