How the Airbnb engineering team “crashed” the main project database in a couple of weeks

In our blog on Habré, we like to sort out interesting cases related to the practical side of using virtual infrastructure by startups. In addition, we draw attention to foreign experience - we analyze everything related to the work of complex IT systems, infrastructure and hardware.

For example, recently we told:

how Spotify scales the Apache Storm ,
considered deep learning hardware
and talked briefly about an example of optimizing bandwidth on Ethernet networks .

Today we came across the technology blog of Airbnb and decided to talk about the experience of this very famous company. According to the engineers, each year the traffic of their service grows 3.5 times, and its peak falls on the summer period. This fact certainly pleases the authorities - the business is booming, but it also sets new challenges for technology specialists.
')

/ photo OuiShare CC

Airbnb provides an online platform for hosting, searching, and short-term rental of private housing around the world. It would seem rather primitive service. Why are there any cloud technologies and performance optimization?

It is clear that the answer is simple - a multimillion audience of service. In addition, the connection of all new regions, which means the constant need to scale IT infrastructure “on the fly”. All this experience is collected in the company's IT blog .

One of the tasks that we personally liked was the scaling of the database. Willy Yao, one of the engineers, talked about how the company was preparing for the summer load peaks (which is quite logical and explainable, summer is the most convenient season for traveling).

As is usually the case in creative and “live” teams, there was a solution that could theoretically save several weeks of work of employees. The bottom line was to use replication in MySQL to ensure data integrity. The task in this situation is always not to create extra work for programmers and not to waste time on data migration.

It is worth noting that the Airbnb blog has repeatedly told that the team uses vertical partitioning by function, in order to distribute the load and eliminate possible failures. For each independent Java and Rails service, they have their own dedicated database, each of which runs on its own RDS instance.

/ photo Sebastiaan ter Burg CC

The rapid growth of the startup did affect the IT component - a huge amount of data remained in the source database, which remained since Airbnb was a monolithic Rails application. In addition, the last breakdown of the database was as much as three years ago, which made it difficult to repeat the procedure with current data volumes.

As a result, it was decided to use the MySQL replication feature to simplify the design process and spend a minimum of effort on it. Such a move is a proven technique.

The team was also helped by the fact that the MySQL database is based on Amazon RDS, so it is relatively easy to create new readable copies (read replicas) and transfer the copy to the independent master server mode.

It was decided to create new replicas and block the ability to write to specific tables in order to preserve the integrity of the data.

To prepare for the transfer was used query analyzer. Its main task is to preserve the integrity and correctness of the work of existing queries using cross links between tables.

According to the plan, it was necessary to order all the database names in special data pipelines, so as a result, the database decided not to be renamed - the names of the old and the new coincided.

Next, it was necessary to understand how a simple inbox service (up to 10 minutes) will affect the work of customer support. For such maneuvers I had to choose the least busy time. The general plan was about the following:

1) make changes to requests for incoming messages - changing the database host in the next step does not require making changes - there are tools to update the settings;

2) redirect all traffic with requests to record incoming messages in the message master;

3) delete all connections to the message database on the master server;

4) check if everything is ready for replication;

5) convert the message master (about 3.5 minutes);

6) deployment on an updated message wizard before subsequent automatic backup to RDS;

7) delete unnecessary tables in the corresponding databases

As a result, Airbnb engineers received a noticeable decrease in the number of records in the main database on the master server. The project itself took about two weeks. During this time, no more than seven 30-second downtime of the incoming message service occurred, and the size of the main database was reduced by 20%.

An even more important result of the project was the achievement of stability of the main database, which was achieved by reducing the number of requests for data recording by 33%.

Source: https://habr.com/ru/post/273583/

All Articles

How the Airbnb engineering team “crashed” the main project database in a couple of weeks

More articles: