How to backup data and MySQL in Amazon Web Services

Hello!
I want to share the experience of organizing backup files and MySQL / XtraDB in Amazon Web Services. I hope the information is useful, especially if you are “forced” to deploy projects in the cloud, and time is limited :-)
But first of all, let's briefly run through the storage technologies offered by Amazon.

Where do the virtual machine data live?

So, let's start with the fact that Amazon offers us data storage of virtual machines - EBS virtual block devices. Such a disk is easily created and connected to the server in 2 clicks. The maximum disk size is 1TB. By default, there is a limit on 5000 disks and 20TB, but it is increased upon the first request .
The technology of local block devices is also proposed, the data on which ... disappears along with the server (and this can easily happen when the machine crashes) - but I will not write about it, because we have not experimented with it.

EBS Performance

It is quite immediately obvious that they are slower than the "iron" ones. The saturation of the device (% util, iostat command) with a random read volume of a dozen or so MB / s (even less by record) quickly approaches 100%. Slowdown is clearly visible on popular operations such as copying folders from disk to disk, unpacking archives, etc. Details and benchmarks easy to find online.
')

Raid?

In order to adequately start working with Amazon disks, the easiest way is to “shove” them into a software raid. For databases, we use raid-10 on 8 EBS disks on both ext4 and xfs. Software raid is done quite simply , it works for a long time and practically does not break.
A raid can be especially useful if an EBS disk suddenly “crashes”.
However, for a number of tasks we do not use raids - for example, to store the binary MySQL log, for backups, etc. And for storing the nginx cache, raid0 on EBS disks, which has been running steadily for about a year, is well suited.

EBS Disk Reliability

To be honest, for one and a half years of working with Amazon EBS disks, they have never failed us (no nonsense like “bedov”, reading errors, etc.) ... except when lightning struck the Irish data center of Amazon - then several disks immediately flew out from raid-10 :-)
However, if you carefully read what Amazon writes about the reliability of its disks, then you realize that you need to do a raid and, of course, regular backups:
It can be used to ensure that all data is replicated. It’s been your last snapshot. As a rule, it’s possible to make it true that it’s the most recent Amazon of of 0.1% - 0.5%, where it refers to the complete loss of volume. AFR of around 4%, which makes it possible to compare 10 times more reliable than conventional commodity disk drives.
On the other hand, we have more than a hundred loaded EBS disks in production and in a year and a half software raids have never knocked out disks due to IO errors. On the "iron" disks, I am sure, we would have already changed more than one device, so draw conclusions.

Available backup technologies

When the data is relatively small and it does not change often, you can play with tar. But imagine a large online store that stores business information both in the database and in the files on the disk: new files appear every minute, and the total size of the content is hundreds of gigabytes.
DRBD? Yes, but did not try this technology in the Amazon and often hear from colleagues about her incredible braking when errors occur.
LVM and snapshots in copy-on-write mode are similar to this technology, only with additional buns and Amazon offers us. Nesting of the block device can be done as many times as necessary. Wherein:

In the next snepshot disc EBS get ONLY CHANGES. And it is completely transparent and it becomes obvious when you look at the monthly bill for the use of disk space. Even if you have 100 snapshots from a 500GB disk, but the data did not change often - you pay for about 600-800GB, which of course plays in favor of the client.
It is possible and necessary to remove snapshots - to preserve the balance between the size of the backup window and the cost of storing data. At the same time, that causes delight, you can delete ANY snepshot - Amazon automatically consolidates the data as needed. And the head on the topic doesn't hurt at all - where the basic snapshot is, and where the incremental is, it doesn't matter if you remove the excess from any position (those who worked with Acronis will appreciate the convenience).
Disk drives are saved in S3 . S3 - as everyone probably already knows, this is a repository of objects of any format that replicates data at least in 2 more datacenters. Those. The snapshot of the disk becomes practically “unkillable” and is saved more reliably than the screw tester in the locked nightstand under the desktop :-).
The ripping of the disk is done almost instantly - then the background data is transferred to S3 for a certain time (sometimes tens of minutes - if the Amazon is loaded).

All this means that we can make snapshots of a huge folder of frequently changing content on the disk at least once every 5 minutes - they will be stored securely in S3 and if you need to roll back 1TB of changeable data 5 minutes ago - we easily do this:

Create a disc from a saved snapshot.
We connect the disk to the server.

Of course, it is technically impossible to instantly transfer 1TB of data from S3 to a SAN where EBS drives live, so although the block device becomes available to the operating system, the data on it will be flooded in the background for some time - therefore, the speed of working with the disk may not be very high at first. But, nevertheless, you will agree how conveniently you can make an incremental backup of a large amount of data and roll it back to any location, for example, a week ago in 5-minute increments? :-)
In addition to the possibility of creating snapshots from EBS disks, you can send files to S3 directly. The s3cmd utility is convenient to use - you can synchronize the file system trees with the cloud in both directions (only changes based on the calculation of md5 on the local disk and storing the md5 object inside s3 in “ETag” are transmitted). We tried solutions based on FUSE technology - s3fs , but noticed slowdowns and long-term freezes with the growth of LA with its intensive use.

Snephot raid

As I wrote above, EBS disks show adequate performance if they are combined into raid0, raid10. And how to backup raid? Take a snapshot of each disk in turn? :-) We understand that it is impossible and the Amazon does not offer us anything here.
Good people wrote a convenient utility - ec2-consistent-snapshot . You can use it, or you can repeat its logic in scripts.

We use a file system that supports "freezing" - i.e. it understands that snapshots are being made from it at the block device level and it is necessary to reset the buffers, commit transactions and temporarily stop block changes. Until recently, XFS understood such a command ( xfs_freeze ), but in the “latest” linux distributions it was possible to “freeze” and other common file systems: ext3 / ext4, xfs, jfs, reiserfs.
We reset changes and briefly prohibit writing to FS: “fsfreeze -f mountpoint”
Make snapshots of each raid drive: AWS API call CreateSnapshot .
Allow write to FS: “fsfreeze -u mountpoint”

If you have xfs, you can use the xfs_freeze command.
To connect a saved raid, it is better to write a script that connects the disks to the machine and starts a software raid from them. The raid saved to the snapshots by the above method perfectly rises without losing the file system log - we use it in different places in production.
So, we learned how to make snapshots of raids in s3 with any amount of data with a frequency of at least once in 5 minutes and recover data from them. Such things, I am sure, will be useful to many people on various projects in the cloud.

Nesneshot machine entirely

Sometimes it is more convenient not to bathe separately with each raid, but to make a snapshot of all the disks of a machine with one team . You can snapshot in 2 modes: with and without stopping the machine. In the latter case, we are logically warned about the possible “corruption” of data on disks / raids:
When taking a snapshot of a file system, we recommend unmounting it first. It is a state of the peace of mind. It can be made without unmounting.
After creating the snapshot of the machine, an AMI (Amazon Machine Image) object appears, with links to the saved snapshots of each of its disks. It is possible to start a server with all disks / raids from this object in one command - AWS API call RunInstances . Feel the power of technology? Working servers can not only be backed up as a whole, but also be lifted from the whole CALEC with all the raids by one team! This technology saved us dozens of hours of system administration in an Amazon accident last August - we raised the machines from snapshots and deployed the configuration in another data center.
However, there is a serious pitfall - the CreateImage command is completely opaque and it is unclear how long it takes snapshots from all server disks - a second or 10 seconds? The interval of 5 seconds was selected by the method of scientific spear, which allows you to shoot complete images of the car with raids. I warn you - test the script carefully before launching into production - however, you see, it’s hard to resist the “goodness” of creating a full machine backup :-)

MySQL incremental backup

Let me remind you of our task - to back up a project with attendance of millions of hits per day and hundreds of gigabytes of frequently changing content (the heaviest content is rendered in s3 and downloaded separately). I repeat the well-known reasonable approaches to the MySQL backup:

Logical backup with slave. In this case, we do not slow down the operation of the main server, however ... we are at risk of backing up “randomly” unsynchronized data (therefore, we need to monitor synchronism, for example, using pt-table-checksum ).
Binary snapshot with the help of LVM from a combat server / slave, or copy the blocks to a DRBD disk on a backup machine.
Incremental binary backup from a combat server or a slave using xtrabackup or similar paid tool .

To be able to quickly roll back a large online store 5-10 minutes ago in the event of a catastrophic deletion of data in the database (an erroneous query that kills data in several tables of orders — who hasn’t yet been with? :-)) - it seems that only option 3 will do . However, as it turned out, the binary incremental backup when creating puts a considerable load on the already weak EBS disks, but also to impose increments on the basic binary backup when recovering can take ... a few hours!
I do not consider recovery scenarios from a logical backup with preliminary editing of the MySQL binary log here - this is still not done quickly.
And here again the Amazon helps us. MySQL incremental backup is done like this:

We reset MySQL / InnoDB / XtraDB buffers to disk: “FLUSH TABLES WITH READ LOCK”
We reset changes and briefly prohibit writing to FS: “fsfreeze -f mountpoint”
Make a snapshot of all machine disks: CreateImage . See above about pitfalls. If there are concerns, we make snapshots of each raid disk from the database: AWS API call CreateSnapshot .
Allow write to FS: “fsfreeze -u mountpoint”
We remove the global locking of all tables in all databases: “UNLOCK TABLES”.

Now we have an AMI object with a hot MySQL backup and we did the maximum possible so that it starts up from the backup as quickly as possible.
Thus, it turned out to be content to simply make an incremental backup of the MySQL server in S3 with a frequency of at least once every 5 minutes and the possibility of its quick input into production. If the server is used in replication, then it will usually restore it without any problems, if you did not forget to set conservative replication settings in the settings (well, or you can quickly return it to work manually):
sync_binlog = 1
innodb_flush_log_at_trx_commit = 1
sync_master_info = 1
sync_relay_log = 1
sync_relay_log_info = 1

How to scripted actions with Amazon?

For the system administrator there are convenient utilities pulling the REST-methods of the API of Amazon. Several utilities are downloaded, for each web service used, and calls to utilities are scripted in bash. Here is an example of a script changing the server hardware:

#!/bin/bash #Change cluster node hw type #Which node to change hardware? NODE_INSTANCE_ID=$1 #To which hw-type to change? #Some 64-bit hw types: t1.micro (1 core, 613M), m1.large (2 cores, 7.5G), m1.xlarge (4 cores, 15G), #m2.xlarge (2 cores, 17G), c1.xlarge (8 cores, 7G) NODE_TARGET_TYPE='c1.xlarge' #To which reserved elastic ip to bind node? NODE_ELASTIC_IP=$2 ec2-stop-instances $NODE_INSTANCE_ID while ec2-describe-instances $NODE_INSTANCE_ID | grep -q stopping do sleep 5 echo 'Waiting' done ec2-modify-instance-attribute --instance-type $NODE_TARGET_TYPE $NODE_INSTANCE_ID ec2-start-instances $NODE_INSTANCE_ID ec2-associate-address $NODE_ELASTIC_IP -i $NODE_INSTANCE_ID

For developers there are libraries in different languages to work with the Amazon API. Here is a library for working from PHP - AWS SDK for PHP .
As you can see, scripting work with Amazon objects is easy.

PS

The architecture of our project is presented here . In addition to backups, I think you should write about a simple cluster autoscaling technique and switching traffic between data centers. On May 22, we are holding a FREE seminar on web clusters and high loads, which will be held in the 1C conference room.
I hope it will be interesting, come :-)

Source: https://habr.com/ru/post/143935/

All Articles