Replication is not a backup. Or not? This is how we used deferred replication for recovery by accidentally deleting shortcuts.
Infrastructure specialists at GitLab are responsible for running GitLab.com - the largest instance of GitLab in nature. There are 3 million users and almost 7 million projects, and this is one of the largest open source SaaS sites with dedicated architecture. Without the PostgreSQL database system, the GitLab.com infrastructure will not go far, and what we are not doing for fault tolerance in case of any failures when data can be lost. It is unlikely that such a catastrophe will happen, but we are well prepared and stocked up with different backup and replication mechanisms.
Replication is not a database backup tool ( see below ). But now we will see how to quickly restore accidentally deleted data using deferred replication: on GitLab.com, the user removed the shortcut for the gitlab-ce
project and lost connections with merge-requests and tasks.
With a deferred replica, we recovered the data in just 1.5 hours. See how it was.
PostgreSQL has a built-in function that restores the state of the database to a specific point in time. It is called Point-in-Time Recovery (PITR) and uses the same mechanisms that keep the replica current: starting with a reliable snapshot of the entire database cluster (basic backup), we apply a number of state changes up to a certain point in time.
To use this feature for cold backup, we regularly make a basic backup of the database and store it in the archive (GitLab archives live in the Google cloud storage ). We also keep track of changes in the state of the database by archiving the write-ahead log (WAL). And with all of this, we can perform PITR for disaster recovery: we start with a snapshot taken before the error and apply the changes from the WAL archive to the point of failure.
Delayed replication is the application of changes from WAL with a delay. That is, the transaction occurred at hour X
, but in the replica it appears with a delay of d
per hour X + d
.
PostgreSQL has 2 ways to set up a physical replica of the database: recovery from the archive and streaming replication. Recovery from the archive , in fact, works like PITR, but continuously: we constantly retrieve changes from the WAL archive and apply them to the replica. And streaming replication directly extracts the WAL stream from the upstream database host. We prefer to restore from the archive - it is easier to manage and it has normal performance, which is not far behind the working cluster.
Recovery options are described in the file recovery.conf
. Example:
standby_mode = 'on' restore_command = '/usr/bin/envdir /etc/wal-ed/env /opt/wal-e/bin/wal-e wal-fetch -p 4 "%f" "%p"' recovery_min_apply_delay = '8h' recovery_target_timeline = 'latest'
With these parameters, we set up a deferred replica with recovery from the archive. Here, wal-e is used to extract WAL segments ( restore_command
) from the archive, and the changes will be applied after eight hours ( recovery_min_apply_delay
). The replica will track changes in the timeline in the archive, for example, due to failover in the cluster ( recovery_target_timeline
).
With recovery_min_apply_delay
you can configure streaming replication with a delay, but there are a couple of tricks that are associated with replication slots, hot spare feedback and so on. The WAL archive allows you to avoid them.
The recovery_min_apply_delay
appeared only in PostgreSQL 9.3. In previous versions, for deferred replication, you need to configure a combination of recovery management functions ( pg_xlog_replay_pause(), pg_xlog_replay_resume()
) or keep WAL segments in the archive for the duration of the delay.
It's interesting to see how PostgreSQL implements deferred recovery. Let's look at recoveryApplyDelay(XlogReaderState)
. It is called from the main repeat cycle for each entry from the WAL.
static bool recoveryApplyDelay(XLogReaderState *record) { uint8 xact_info; TimestampTz xtime; long secs; int microsecs; /* nothing to do if no delay configured */ if (recovery_min_apply_delay <= 0) return false; /* no delay is applied on a database not yet consistent */ if (!reachedConsistency) return false; /* * Is it a COMMIT record? * * We deliberately choose not to delay aborts since they have no effect on * MVCC. We already allow replay of records that don't have a timestamp, * so there is already opportunity for issues caused by early conflicts on * standbys. */ if (XLogRecGetRmid(record) != RM_XACT_ID) return false; xact_info = XLogRecGetInfo(record) & XLOG_XACT_OPMASK; if (xact_info != XLOG_XACT_COMMIT && xact_info != XLOG_XACT_COMMIT_PREPARED) return false; if (!getRecordTimestamp(record, &xtime)) return false; recoveryDelayUntilTime = TimestampTzPlusMilliseconds(xtime, recovery_min_apply_delay); /* * Exit without arming the latch if it's already past time to apply this * record */ TimestampDifference(GetCurrentTimestamp(), recoveryDelayUntilTime, &secs, µsecs); if (secs <= 0 && microsecs <= 0) return false; while (true) { // Shortened: // Use WaitLatch until we reached recoveryDelayUntilTime // and then break; } return true; }
The bottom line is that the delay is based on the physical time recorded in the time stamp of the transaction commit ( xtime
). As you can see, the delay is applied only to commits and does not touch other entries - all changes are applied directly, and the commit is postponed, so that we will see changes only after the configured delay.
Suppose we have a database cluster and a replica with an eight-hour delay in production. Let's see how to recover data using the example of accidental deletion of shortcuts .
When we learned about the problem, we suspended the restore from the archive for the deferred replica:
SELECT pg_xlog_replay_pause();
With a pause, we had no risk that the replica would repeat the DELETE
request. A useful thing if you need time to figure it out.
The bottom line is that the deferred replica should go to the point before the DELETE
request. We roughly knew the physical removal time. We removed recovery_min_apply_delay
and added recovery_target_time
to recovery.conf
. So the replica comes to the right moment without delay:
recovery_target_time = '2018-10-12 09:25:00+00'
With time stamps it is better to reduce the excess, so as not to miss. True, the more the decrease, the more data we lose. Again, if we skip the DELETE
request, everything will be deleted again and you will have to start over (or even take a cold backup for PITR).
We restarted the postponed Postgres instance, and the WAL segments were repeated until the specified time. You can track progress at this stage by asking:
SELECT -- current location in WAL pg_last_xlog_replay_location(), -- current transaction timestamp (state of the replica) pg_last_xact_replay_timestamp(), -- current physical time now(), -- the amount of time still to be applied until recovery_target_time has been reached '2018-10-12 09:25:00+00'::timestamptz - pg_last_xact_replay_timestamp() as delay;
If the timestamp no longer changes, recovery is complete. You can configure the recovery_target_action
action to close, promote, or pause the instance after the replay (by default, it pauses).
The database has come to a state before that ill-fated query. Now you can, for example, export data. We exported the deleted label data and all links to tasks and merge requests and transferred them to the working database. If the losses are massive, you can simply push the line and use it as the main one. But then all changes will be lost after the moment before which we are restored.
Instead of time stamps, it is better to use transaction IDs. It is useful to write these IDs, for example, for DDL statements (of type DROP TABLE
), using log_statements = 'ddl'
. If we had a transaction ID, we would have taken recovery_target_xid
and drove everything down to the transaction before the DELETE
request.
Back to work is very simple: remove all changes from recovery.conf
and restart Postgres. Soon an eight-hour delay will reappear in the replica, and we are ready for future trouble.
With a deferred replica instead of a cold backup, you do not have to spend hours restoring the entire snapshot from the archive. We, for example, need five hours to get the entire basic 2 TB backup. And then you have to apply the entire daily WAL to recover to the desired state (in the worst case).
Deferred replica is better than cold backup in two points:
And we are constantly checking whether it is possible to make a PITR from WAL, and we would quickly notice damage or other problems with the WAL archive, keeping track of the lag of the delayed replica.
In this example, it took us 50 minutes to restore, that is, the speed was 110 GB of WAL data per hour (the archive was still on AWS S3 ). In total, we solved the problem and restored the data in 1.5 hours.
Use deferred replication as a first aid tool if you accidentally lose data and notice this trouble within the configured delay.
But note: replication is not a backup.
Backup and replication have different goals. A cold backup is useful if you accidentally make a DELETE
or DROP TABLE
. We backup from cold storage and restore the previous state of the table or the entire database. But at the same time, the DROP TABLE
request is almost instantly reproduced in all replicas on the working cluster, so normal replication will not save here. By itself, replication keeps the database available when individual servers are handed over and distributes the workload.
Even with a deferred replica, we sometimes really need cold backups in a safe place if a data center crash, hidden damage, or other events that you don’t immediately notice. There is no sense from one replication.
Note At GitLab.com, we now protect against data loss only at the system level and do not recover data at the user level.
Source: https://habr.com/ru/post/445446/