📜 ⬆️ ⬇️

Gitlab "lies", the base is destroyed (restored)

image Yesterday, January 31, the Gitlab service accidentally destroyed its production database (the git repositories themselves were not affected).

It was like this.

For some reason, the hot-standby base replica (PostgreSQL) began to lag (the replica was the only one). The gitlab employee for some time tried to influence the situation with various settings, etc., then decided to erase everything and pour a replica again. I tried to erase the folder with the data on the replica, but confused the server and erased it on the master (I made rm -rf on db1.cluster.gitlab.com instead of db2.cluster.gitlab.com).

Interestingly, there were 5 different types of backups / replicas in the system, and none of this worked. There was only a LVM snapshot made by chance 6 hours before the fall.
')
Here, I give an abbreviated quote from their document. Detected problems:

1) LVM snapshots are taken only after 24 hours.
2) Regular backups where they are stored.
3) Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
4) The synchronization process removes webhooks once it has synchronized data to staging. 24 hours a day they will be lost
5) The replication procedure is a bit more fragile, prone to error, and it is badly documented.
6) Our backups to S3
7) We don’t have solid alerting / paging for when backups fails, we’re seeing this.

Thus, gitlab concludes, out of 5 backups / replication techniques, nothing worked reliably and as it should = => therefore, recovery from a random 6-hour backup is underway

Here is the full text of the document

Source: https://habr.com/ru/post/320988/


All Articles