📜 ⬆️ ⬇️

GitLab PostgreSQL postmortem

On January 31, 2017, GitLab experienced an accident related to the operation of the PostgreSQL DBMS, as a result of which part of the data was deleted, and the project was stopped for the duration of the recovery. Several months have passed, and a lot has been written on this topic, and GitLab himself presented an exhaustive obituary in which he told what happened, what measures were taken to restore and what measures would be taken to prevent such accidents. Very entertaining reading, we recommend to read it even to those who are far from Postgres.



In comments to our interview with Alexey Lesovsky , some representatives of the community, jokingly, made a claim that we mentioned the GitLab accident, but in the end we didn’t conduct a detailed debriefing. We decided to improve and asked Alexey to write a small “debriefing”. The main purpose of this publication is a detailed analysis of the obituary, highlighting the key points, an attempt to analyze them and offer recommendations on how to act in such a situation. And, of course, consider the measures that the GitLab team plans to take to prevent such incidents in the future.

The list of key points in chronological order, to which we pay attention, is as follows:
')
  1. LVM snapshots as a means of moving a database from production to staging.
  2. What to do when the replica is behind the master?
  3. We do pg_basebackup correctly, part 1.
  4. Out max_wal_senders, but how?
  5. max_connections = 8000.
  6. “Hanging” pg_basebackup, or doing pg_basebackup correctly, part 2.
  7. strace: what is good for the admin, it is not always necessary for dBA.
  8. rm or not rm: how to deal with sensitive data?
  9. Backup methods.
  10. Different versions of Postgres on production.
  11. Idle mail in the crown.

Let's start debriefing. Also, at the end there will be a second list, in it we will briefly review the measures that have been or will be taken in the future to avoid a repetition of similar incidents.

1. LVM snapshots as a means of transferring a database from production to staging.

The task of updating the staging environment is quite common, therefore, proven practices and solutions have already appeared. One of these solutions is to create a snapshot and transfer the snapshot to staging. The solution is quite convenient and customizable, but it has several drawbacks. Most implementations of snapshots are made on the basis of copy-on-write, therefore, when creating a snapshot at the time of its existence, we receive additional load on disk storage. Basically, this is important for write-intensive loads. In the case where the latency level is important to you, this can be a problem.

The more correct way to transfer the database from production to staging are backups. There is a configured backup, and staging is updated from there. In this scenario, no additional load is created by snapshots and indirectly checked backups for successful recovery. If staging is not updated - this is a clear sign that something is wrong with the backups.

2. What to do when the replica is behind the master?

Unfortunately, the article does not mention how engineers discovered the presence of replication lag, but, given that the replica eventually fell off irrevocably, we can conclude that we noticed too late. What recommendations can be made here? Of course, monitoring replication lag, ideally with alert. But the “knee” method of checking the lag with a script for the crown will also go along with the subsequent sending of a letter, SMS or something else.

When using streaming replication, lag monitoring is one of those things that should be done immediately after setup. There are many reasons why a replica may start to lag behind. To monitor replication in Postgres, there is a pg_stat_replication view - this is a very cool thing, with some experience, it allows you to determine even the cause of the lag: network, disks, long requests, etc.

If you find that the replica is lagging behind, and you don’t know how soon it will fall off permanently, then a temporary solution is to increase the wal_keep_segments parameter. This parameter is updated without restarting the postgrese, a regular reload (see pg_reload_conf ()), but it’s worth remembering that the price may be an increase in disk space usage, 1 segment is 16MB. After the parameter is increased, you can begin to search and eliminate the causes of the lag.

You can also note that the guys admitted that they do not use WAL archiving, which is also in vain - it would help restore the replica. But, in general, the role of WAL archives is not only in the task of backing up the replica. In the presence of backup through pg_basebackup, the use of archives allows you to flexibly manage the recovery (see Point in time recovery).

3. Do pg_basebackup correctly, part 1.

As is known from the obituary, before starting pg_basebackup, the engineer cleared the directory to which the backup was supposed to be copied. Indeed, pg_basebackup, upon detecting that the directory is not empty, will immediately exit. However, when dealing with sensitive data, it is always better to use mv instead of rm. The time of disks with a small volume has long passed, and such a rule can make life much easier. This applies not only to directories, but also tables in databases, databases with the _old suffix and much more. Although this will not work with terabyte databases, it is always better to rename the object after a couple of weeks, making sure that it is not needed by anyone, to be silently deleted than to do it right away.

4. max_wal_senders ended, but how?

When connecting pg_basebackup, an error exceeding the limit max_wal_senders was observed. This parameter defines the limit of simultaneous connections for replication. If you remember, GitLab had only one replica, and it irretrievably fell off. Thus, all connections should have been free. However, the engineering team encountered an error, and there are no other options besides trying to run several pg_basebackup at the same time.

In general, it makes no sense: if the first pg_basebackup doesn't work, then why should it earn the second one? But, if we still encounter this error, it is worth remembering that changing max_wal_senders requires restarting the postgres (which is often unacceptable), and it was better to find out who occupied the connections (and these were the waiting pg_basebackup) and stop them. Why raise the limit? To run more parallel pg_basebackup?

5. max_connections = 8000

The next point is the connection limit set to 8000. This is a very large number for this setting. Postgres is arranged in such a way that for each client connection it generates a separate process. All this works well while the actively working postgresovy backends do not become more than the processor cores in the system. After that, with a further increase in processes, productivity begins to decline.

In order not to keep a large number of connections to the postgres, a pgbouncer was invented that allows you to compress a large number of client connections into a relatively small number of connections to the postgres. For example, the same 8000 client connections can be closed on pgbouncer, and that will establish connections to the postgres only as needed. Again, using pgbouncer would potentially help avoid wasting time on the error and additional restarting the postgres.

6. “Hanging” pg_basebackup, or doing pg_basebackup correctly, part 2.

Moving on: the limits of connections are fixed, now we need to figure out why pg_basebackup hangs? The reason for the so-called hangup lies in the specifics of pg_basebackup. Before you start copying the database, you need to wait for the execution of the so-called. “Checkpoint” (checkpoint). Checkpoint - this is a procedure that can greatly strain postgres, with the result that the performance will thoroughly squeeze. Therefore, in order not to strain the system, pg_basebackup by default waits for the regular checkpoint to complete, the time and speed of which depends on the postgre settings. If you specify the "-c fast" parameter, then pg_basebackup initiates an urgent checkoint to execute without delay.

As you understand, in the process of waiting, when the checkpoint is completed, pg_basebackup does not write any messages, and this played a cruel joke with the engineer. However, thanks to the GitLab team, the postgres will now get better .

7. strace: what is good for the admin, it is not always necessary for DBA.

Next, the engineers tried to figure out why pg_basebackup hangs, and used strace for this. From my own experience I can say that strace is not exactly what is needed. strace is needed at the moment when the process under investigation falls with errors or at some points in time does not work correctly. You can understand that the process is “hanging” by using the top utility: if we see 0 in the% CPU field, the process is likely to be pending.

And to investigate the causes of the so-called “hangs,” a trace trace of the process is needed in order to understand what function it is on. For this, the / proc // stack and, if you have some experience, gdb are suitable. Unfortunately, Postgres by itself does not represent an opportunity to see what the process there is that is waiting to start copying, so engineers should be aware of the features of pg_basebackup.

8. rm or not rm: how to deal with sensitive data?

We came to the moment when the database directory on the master was deleted. Actually, this item is a repetition of the above: try to use mv, if possible. Use rm only after you are convinced that no one has realized about the allegedly missing data.

9. Backup methods.

So, the accident occurred, and you need to perform a restore from backup. Let's see what backup methods the team relied on:


Anyway, the general recommendation is: try not to use replicas as a backup, you have your own tools for backup. The most preferred tool is pg_basebackup + WAL archiving. The main advantage of using this method is the ability to restore data to a specific point in time or transaction number within the existing archive. However, this solution has a drawback - the requirement for the place. Each base backup is a complete copy of the instance, and the size of the archive depends on the number of CRUD operations. Therefore, when setting up a backup, you should take into account the total storage size for backup copies, taking into account the number of storage days.

At the moment there are several products that implement this backup model:

  1. Barman, if you have your own servers.
  2. WAL-E, if your servers are in Amazon.

The last thing to add is the need to monitor that 1) backups are being made successfully; 2) backup copies exist at all; 3) is the result of a test restore from backup.

10. Different postgres versions on production.

The next point that you can find fault with is the availability of different versions of potsgres on production. In the case of GitLab, this just led to the inability to make backup copies. Therefore, in any plan for upgrading to the next major version, there should be an item for deleting old post-upgrade packages.

11. Idle mail in the crown.

It seems to me, and many may disagree with me, email notifications are becoming a rather rare notification mechanism. However, no matter what mechanism you use, you should always be sure that it really works. In my practice, there was a case when inside the monitoring that notified the SMS team, there was a function of sending a test SMS every evening to the duty administrator, to check that everything is fine with the operator and with the balance everything is fine. In case of failure, spam in the mail began and the flag in the web interface was lit - and it really helped.

In the case of GitLab, mail for crown was simply not configured initially. If you put something important in crowns, think about notifications.

As a result, the base was restored from the LVM image, copying took 18 hours, and this is a very long period. At the same time, the snapshot was on staging, and this turned out to be the most acceptable option. All this is an unsuccessful chosen backup strategy (every 24 hours) and a coincidence. The result should be a more thoughtful approach to backups with more flexible recovery options that provide minimal data loss.

The GitLab team posted information about the measures that should be taken in order to avoid similar events in the future. Let's briefly analyze them.

1. Update PS1 across all hosts to more clearly differentiate between hosts and environments (# 1094).

Looks quite reasonable. Making an individual command prompt text is a fairly frequent and correct practice.

2. Prometheus monitoring for backups (# 1095).

This measure also looks very reasonable, but monitoring settings should be done after the entire backup system has been revised and rebuilt. Monitoring should include not only the backup age and size, but also the number of available backups, as well as information about which copies were successfully checked for recovery.

3. Set PostgreSQL's max_connections to a sane value (# 1096).

Even 800 is too much, I would recommend looking towards the pgbouncer and try using it. However, he has his limitations, which he imposes on the application that works with him, so here you need to pre-test.

4. Investigate Point in time recovery & continuous archiving for PostgreSQL (# 1097).

Yes, this should definitely be done.

5. Hourly LVM snapshots of the production databases (# 1098).

A dubious idea, given that the pictures were used to update staging. For this purpose, backups made under task # 1097 are also quite suitable.

6. Azure disk snapshots of production databases (# 1099).

It is difficult to comment, but if you want to duplicate the main functionality for backup copies, then why not?

7. Move staging to the ARM environment (# 1100).

The task looks strange. To restore production from staging, where all data can be significantly changed, is a dubious idea, in the end, such a recovery will only get worse.

8. Recover production replica (s) (# 1101).

This of course needs to be done ASAP. Any more or less serious postgrese installation is now deployed with at least one stream replica.

9. Automated testing of recovering PostgreSQL database backups (# 1102).

Yes, this is a very necessary thing, it is definitely needed.

10. Improve PostgreSQL replication documentation / runbooks (# 1103).

Yes, the instructions need to be updated in such a way that even the night watchman of the neighboring warehouse can make a recovery on them.

11. Investigate pgbarman for creating PostgreSQL backups (# 1105).

Yes, the barman is definitely a good tool, but considering that GitLab uses Amazon S3, WAL-E looks better. However, Barman developers do not rule out the possibility, and even offer to sponsor support for Amazon S3 at Barman.

12. Investigate using Backup and Realtime Replication (# 494).

Yes, the thing is reliable, despite the considerable number of issues in the tracker, it works stably and without complaints.

13. Build Streaming Database Restore.

Yes, that's right, there are backups and there is a check for the restoration of these backups through restoration on the staging environment.

14. Assign an owner for data durability.

A responsible person is definitely needed. In addition, it is advisable for all to read the instructions that are made under # 1103.



Many thanks to Alexey for the detailed analysis. Be sure to leave your comments and thoughts on the accident that happened in Gitlab and the measures taken! And, of course, we invite everyone to discuss this topic with Lesha in summer in St. Petersburg on PG Day Russia ! :)

Source: https://habr.com/ru/post/328938/


All Articles