
There is a master, the master unexpectedly fell, but the system continues to work. Customers migrate to second base. It is necessary to make backup copies of the database. If you make backups on the main database, we can get some performance problems, an increase in response time. This is bad. Therefore, a fairly common example of asynchronous replication is to take a backup from a slave. Another example is the migration of heavy requests from the master to the slave, from the main base to the second. For example, building reports.
Sometimes it is necessary for the application to receive all updates from the database, and preferably in real time. This is the open source library called libslave.

')
This article is a transcript of Oleg Tsarev's report ( zabivator ), a year after reading Oleg published another article on this topic - PostgreSQL vs MySQL .
Go to the
link on the slide , read - great article.
If we put everything together, we will get something like this:

We have one master and a bunch of slaves - a seilover for backup if the master is dropped, a slave for backups, a slave for building reports, and several slaves that relay changes to a piece called bannerd (this is the name of the daemon and it works through libslave) . There are a lot of them, so there are such proxies.
We have a large enough project with a large enough database, while we always work, our service does not fall. We distribute advertising, and we have quite a serious load, our replication is used everywhere.
The main property of the database is that it guarantees the “all or nothing” principle, i.e. changes occur either entirely or not at all. How is this done, how does the database guarantee data integrity?

There is a server, RAM, disk, clients who go to the database. Memory is organized in the form of pages. When a request comes to update the data, the page is modified first in RAM, then gets on the disk.
The problem is that the principle of "all or nothing" on modern "hardware" is not possible. These are the physical limitations of the world. Even with RAM - transactional memory appeared only recently with Intel. And it is not clear how to live with this ... The solution is the magazine:

We write in a separate place - in the journal - what we want to do. We first write all the data in the log, and after the log is fixed on the disk, we change the data in memory itself. Then, perhaps much later, this data will be on the disk. Those. the magazine solves many problems. Data consistency is just one of the functions.
This algorithm is called Point In Time Recovery or PITR. I suggest to get information on the links:

This is very informative.
The main issues that arise before the developer of any database:
- how to organize a magazine?
- how to write it?
- how to write it less?
- How to make it work faster?
- What have replication?
The direct way to replicate is to copy the log from the master to the slave and apply it to the slave. The log is used to guarantee data consistency. The same mechanism can be applied to the slave. And we get replication, in fact, adding almost nothing to our system.
PostgreSQL works that way. His journal is called Write-Ahead Log, and physical changes fall into it, i.e. update pages. There is a page in memory, there are some data on it, we did something with it - this is the difference we write to the log, and then he leaves for the slave.
How many logs are in mysql? Let's figure it out. Initially, there were no logs in MySQL, in general. There was a MyISAM engine, but there is no journal in it.
In the picture you can see the thing called Storage Engines:

Storage Engine is an entity that deals with how to write data to a disk and how we can read it from there, how to search for it, etc.
Then screwed up the replication. Replication is one line in the leftmost box — Management Services & Utilites.
Replication required a log. He began to write. It is called Binary Log. Nobody thought about using it in any other way, they just did it.
Around this time, MySQL donated a new Storage Engine, called InnoDB. This is a widely used thing, and InnoDB has its own magazine. It turned out two magazines - InnoDB and Binary Log. This moment became the point of no return, after which there were problems that are very difficult to solve.
Binary Log is not used for Point In Time Recovery, and InnoDB Undo / Redo Log is not used in replication. It turned out that PostgreSQL has one journal, and MySQL has two, as it were, but Binary Log, which is needed for replication, has two or three formats (type).
The very first type that appeared, which was the easiest to do, was the Statement-based Binary Log. What it is? It is simply a file in which the transaction after transaction is sequentially written. It looks like this:

The transaction indicates the database on which these updates are made, the timestamp of the start time of the transaction, and then the transaction itself goes.
The second type is called Row-based replication. This is a journal in which not the queries themselves are written, but the lines that they change. It consists of two parts - BEFORE image and AFTER image:

In the picture BEFORE image above, and AFTER image below.
In the BEFORE image are placed those lines that were before the transaction. Red lines are marked with lines that are deleted:

They are from the BEFORE image at the top, but they are not at the bottom - in the AFTER image, which means they are deleted.
In the following picture green lines are added that are added:

There are blue UPDATEs in both the BEFORE image and the AFTER image. This is an update.
The problem with this solution is related to the fact that, until recently, all columns were written to the Row-based replication log, even if we updated one. In MySQL 5.6, this was fixed, and this should be easier.
There is another type of Binary Log'a - Mixed-based. It works as either Statement-based or Row-based, but it is not widely distributed.
Which of these magazines is better?
First, let's talk about relational tables. It is often thought that a relational table is an array. Some even think that this is a two-dimensional array. Actually, this is a much more complicated thing. This multiset is a collection of a certain sort of tuples, over which no order is given. There is no order in the SQL table. It is important. And, as a result, when you make a SELECT * from the database (scan all records), the result of the query execution may change - the lines may be in one order, or they may be in another. This must be remembered.
Here is an example of a request that will not work correctly in Statement-based replication:

We removed the primary_key from the table and added a new one — auto-increment. On the master and slave different order of lines. So we received inconsistent data. This is a feature of Statement-based replication, and there is not much to do with it.
This is a quote from the official MySQL documentation:

It is necessary to create another table, pour data into it, and then rename it. This feature can "shoot" in the most unexpected places.
Probably, the next slide is one of the most important in the report on how replication can be classified:

Working at the storage level, as PostgreSQL does, is called physical replication — we work directly with pages. And Row-based replication, where we store a set of tuples before and after a transaction, is logical.
And Statement-based replication is generally at the request level. They don’t do it this way, but this is done ... This is an important interesting feature: when Row-based replication works with us, i.e. logical replication, it does not know exactly how the data is stored on disk. It turns out that in order for replication to work, you need to perform some operations in memory.
It also turns out that physical replication (PostgreSQL, InnoDB) rests mainly on the disk, and MySQL replication rests mainly on the slave, both of which are Row-based and Statement-based. Row-based you just need to find the lines and make the update, and with the Statement-based everything is much worse - you need to run the query for it. If the request on the master was executed, for example, half an hour, then it will be executed on the slave for half an hour. This is replication, but rather unsuccessful.
In addition, PostgreSQL writes to disk in two places - in the data store and in the log. MySQL has three such places - storage (tablespace), log (undo / redo log), and Binary Log, which is used in replication, i.e. write to disk need 1.5 times more. MySQL is a great architecture, but there are often problems with it.
Many have seen lagging replicas of MySQL. How to find the cause of the braking replica? Diagnose hard. There is a diagnostic tool in MySQL, called the log of slow queries. You can open it, find the top of the most difficult requests and fix them. With replication, this does not work. It is necessary to carry out a statistical analysis - read statistics - which tables have become more frequently used. Manually this is very hard to do.
In MySQL 5.6 / 5.7, SLAVE PERFORMANCE SCHEMA appeared, on the basis of which such diagnostics are easier to conduct. We usually open the commit log in puppet and see what we rolled out at the time when replication started to fall behind. Sometimes even this does not help, you have to go through all the developers and ask what they did, whether they broke the replication. It's sad, but you have to live with it.
In asynchronous replication there is a master where we write, and there is a slave from which we only read. Slave should not affect the master. And in PostgreSQL it does not affect. This is unfortunately not the case in MySQL. In order for Statement-based replication, which replicates queries, to work correctly, there is a special flag. In InnoDB, notice, i.e. Our architecture shares replication above, and storage engine below. But the storage engine, in order for replication to work, must, roughly speaking, slow down the inserts in the table.
Another problem is that the master executes requests in parallel, i.e. simultaneously, and the slave can apply them sequentially. The question arises - why can't a slave apply them in parallel? In fact, this is not easy. There is a transaction serialization theorem that tells when we can execute queries in parallel, and when sequentially. This is a separate complex topic, sort it out if you are interested and need, for example, by reading the link -
http://plumqqz.livejournal.com/387380.html .
In PostgreSQL, replication rests mainly on disk. The disk does not parallel, and we somehow do not care about one thread, anyway, we are sitting mainly in the disk. CPU we almost do not consume.
In MySQL, replication rests on the processor. This is a beautiful picture - a large, powerful server, 12 cores. One core works, at the same time it is occupied by replication. Because of this, the replica is choking. It is very sad.
In order to execute requests in parallel there is a grouping of requests. InnoDB has a special option that controls exactly how we group transactions, how we write them to disk. The problem is that we can group them at the InnoDB level, and the level above - at the replication level - this functionality was not. In 2010, Christian Nelsen from MariaDB implemented such a feature called Group Binary Log Commit - we will talk about it a bit later. It turns out that we are repeating the journal (and this is a rather complicated data structure) at two levels - Storage Engine and replication, and we need to drag features from one level to another. This is a complex mechanism. Moreover, we need to simultaneously write consistently to two journals at once - two-phase-commit. This is even worse.
In the next picture we see two graphics:

The blue graph shows how InnoDB scales when we add threads to it. Throwing threads in - the number of transactions that it processes increases.
The red line shows the situation when replication is enabled. We enable replication and lose scalability. Because the log in Binary Log is written synchronously, and Group Binary Log Commit solves this.
It's sad that you have to do this because of the separation - the Storage Engine downstairs, replication up above. This is all bad. In MySQL 5.6 and 5.7, this problem is solved - there is a Group Binary Log Commit, and the wizard is now not lagging behind. Now they are trying to use this for replication parallelism, so that on the slave, requests from one group can be run in parallel. Then I wrote that from this you need to twist:

Since October 2013, since we have a lot of data, replication is constantly lagging behind, everyone is upset, I tried to see this parallelism. Perhaps I didn’t understand something else, didn’t set something up, there were many attempts, and the results look like this:

The blue graph is MySQL 5.5.
Y-axis - CPU consumption on the slave. X axis - time.
On this graph, we can see when the replication began to catch up with the master and when she finished. It turns out an interesting picture - that 5.5 in one thread works about the same as parallel replication in 5.7 in four threads. Those. The processor consumes more (green line), and it works in time in the same way. There work four threads, four streams. If you make one thread in 5.7, it will work worse. This is a kind of regression, they wanted to fix it in 5.7.5, but I checked that the problem is still relevant. On my benchmarks this is so, this is so on production tests, this is a given. I hope this fixes.
What else is the problem - in order to migrate without stopping the service, at one point in time, MySQL 5.5 will be running on the master and 5.7 on the slave. In 5.5 there is no Group Binary Log Commit, it means 5.7 can work only in one stream. This means that our remark on 5.7 will start to fall behind and never catch up. As long as there is a regression with a single-stream 5.7 replication, we cannot migrate, we are sitting at 5.5, we have no choice.
Now, the most interesting part is that I’ll summarize everything I told and what was left out of the report due to a time limit (I have about three hours of material).
First, architecturally there are three types of logs, there is replication at the physical level and at the logical level. The physical layer is the pages. PostgreSQL is powerful in that everything goes through its journal — in general, updating tables, creating triggers, creating stored procedures — and therefore there are fewer problems with it. In MySQL, depending on what type of replication we have enabled, we get either logical replication or query-level replication.
In principle, any of these magazines has its pros and cons, because you need to choose carefully.
What are they strong / weak:

Worth * in the first line. I will explain - to the slave, anyway, you need to copy the log from the master, plus the slave may ask the master for some reason not to delete the log.
In MySQL, there are two serious penalties, threads, as the master is affected by replication:
- option in InnoDB to enable statement-based replication;
- without Group Binary Log Commit we do not receive scaling.
Row-based replication in MySQL works better, but it also has its own problems.
Next, Slave. PostgreSQL rests on the disk, MySQL - in the processor.
From the point of view of consumption of disks here it is more interesting. For example, in Row-based replication in MySQL (in PostgreSQL it will be about the same) tens of terabytes of logs are received per day, we simply do not have so many disks to store it all, so we sit on Statement-based. This is also important - if the replica is lagging behind, we need to keep a journal somewhere. In this sense, PostgreSQL looks worse when compared to statement-based replication.
With the slave processor, it is important for us to build good indexes on the slave so that the lines are easy to find so that the queries work well. This is a rather strange metric. We optimize the slave in terms of replication efficiency, i.e. we want a slave in order to build reports, and we still have to adjust, so that the slave not only builds reports, but also has time to catch up. MySQL parallel slave 5.6 / 5.7 - we are waiting for it to work well until it meets expectations.
Another important topic is data consistency.
PostgreSQL replica is a binary copy of the wizard. Those. literally - if you stop writing to the master, give the replication a ride to the end on the slave, stop the process on the master and the slave and make a binary comparison of the PostgreSQL wizard and PostgreSQL slave, you will see that they are the same. This is not the case in MySQL. Row-based replication, which works with a logical view, with tuples - in it all the updates, insert's and delete's work correctly, everything is fine with them.
This is not the case in Statement-based replication. If you misconfigure the wizard and run certain tricky queries, you can get different results. With queries that work with the database schema — creating tables, building indexes, etc. — is still sadder, they always go like raw queries ... You need to constantly remember the specifics of Statement-based replication.
With a mixed-based story is even more interesting - it is either this or that, everything must be watched.
Flexibility. MySQL is really better at the moment because replication is more flexible. You can build different indexes on the master and the slave, you can even change the data schema - sometimes it is necessary, but in PostgreSQL now there is no such possibility. In addition, there is libslave in MySQL - this is a very powerful thing, we love it very much. Our demons pretend to be MySQL slaves and they constantly receive updates in real time. Our delay is about 5 seconds. - the user saw the banner or clicked on it, the daemon is all aggregated, recorded in the database, after 5 seconds. the demon who is giving out banners has learned about it. There is no such tool in PostgreSQL.
However, PostgreSQL plans the following. First, there is such a thing as Logical Log Streaming Replication - a way to transform Write-Ahead Log. For example, we do not want to replicate all tables from this database, but we want to replicate only a part.
Logical Log Streaming Replication allows the wizard to explain what the tables will leave for the slave.There is also Logical Decoding - a way to visualize what is in the PostgreSQL Write-Ahead Log. In fact, if we can print in some form what is happening on the slave, more precisely, what came through Write-Ahead Log, this means that we can programmatically implement everything that libslave does. We received insert, update, delete, the necessary callback “jerked” at us, we learned about the changes. This is Logical Decoding.The conclusion from this is quite interesting - it is best to make a normal journal. In PostgreSQL, the log is normal; all data updates, all schema changes, and everything else get there. This gives a bunch of buns, for example, correctly working replication, replication, which rests only on the disk, and not on the processor. Having such a journal, it is already possible to add a certain set of patches to the engine itself, for the master, for the slave, which allows for increased flexibility, i.e. filter tables, have other indexes.But for historical reasons, MySQL has turned out a bad journal, i.e. MySQL is a hostage to its historical development. To solve the problems with performance and correctness that occurred in MySQL, you need to rewrite the entire architecture that is associated with the Storage Engine, which is unrealistic.It will take quite a bit of time, I think, and PostgreSQL will catch up with MySQL.And finally. Even if you have not understood a lot of things from the whole report or you need to understand, but in any case, remember the two most important conclusions:- Replication is not a backup copy (backup).
- A table is not a two-dimensional array, but a homogeneous multiset of tuples. So correctly from the point of view of computer science.
This report helped me to do a lot of people:
Contacts
zabivatorThis report is a transcript of one of the best speeches at the conference of developers of high-loaded systems HighLoad ++ . Now we are actively preparing for the conference in 2016 - this year HighLoad ++ will be held in Skolkovo on November 7 and 8.
The theme of replication is eternal :) This year, we will in one way or another touch on it in two reports.
- HighLoad.Guide — , , , . 30 . !