How does MySQL replication work?

How does MySQL replication work?

Andrey Aksenov (Sphinx), shodan

My report is intended for those people who know the word "replication", even know that it exists in MySQL, and perhaps it was set up once, they spent 15 minutes and forgot. More about her, they do not know anything.

We will go through the theory for a bit, try to explain how it all works inside, and after that you can dive into the documentation with triple strengths.

What is replication, in principle? This is a copy of the changes. We have one copy of the database, we want, for some purpose, another copy.

Replication is of different types. Different axes of comparison:

the degree of synchronization of changes (sync, async, semisync);
number of recording servers (M / S, M / M);
format changes (statement-based (SBR), row-based (RBR), mixed);
theoretically, the change transfer model (push, pull).

The report will not:

All this is on the Internet, the syntax does not make sense.

Funny fact - if you think a little, replication theoretically helps us, as a matter of principle, to read only the scale. Here is a somewhat unobvious conclusion. This is because if we need to pour a certain amount of changes on the same copy of the data, and this specific copy of the data is served by the same server, then this server is able to withstand a certain number of updates per second and not upload more there. The server is able to update 1000 records per second, and 2000 is not capable. What will change from the fact that you put a replica to this server, it does not matter in the master-slave or master-master mode? Can you pour a second thousand updates on this cue? The correct answer is no.

You can, of course, be able to add additional updates to the replica in master mode, another thing is that when they don’t fly to the first master and try to make the second thousand updates on it, then the capacity is not enough. It is necessary to understand and not to confuse two almost obvious points that replication, as it were, about one thing, but that the data must be split, and if it is necessary not to read, but write, you will have to do something else, and replication will not save much .

Those. replication is more about reading.

About synchronization.

Synchronization is a guarantee of availability and availability. Availability in the sense that we have completed the commit, the transaction is committed, everything is fine, this data is visible to one or several nodes in the cluster, they can participate in the following requests. Availability is that the data, in principle, is on more than one server, but perhaps the transaction is not lost and is not available.

There is no refrain "commit ended successfully, what does this mean?". A synchronous commit means that we have local and remote (at least on one replica) ended, i.e. we have committed something to the machine, if we have a synchronous replication mode, then these changes are successfully committed, they are visible for subsequent requests on the local machine, on the remote machine (at least one) are also visible. This means that if a standard emergency situation occurred, i.e. in one server and scrap arrived and broke through everything - from the processor to the screw itself, despite this, the data is not only copied to a remote server, but also, in addition, can instantly, without any additional delay, participate in subsequent transactions.

This is all common terminology, completely unrelated to MySQL. In any distributed system, it will be arranged like this.

Asynchronous commit - no additional guarantees, as lucky.

A semisynchronous commit is a pleasant intermediate solution, this is when we have a local commit passed, nothing is known about the remote commit - maybe the slave caught up, but maybe it didn't catch up, but at least we received confirmation that this data was they flew away and were accepted there and probably signed up.

Pro server for recording. What are the types of replication.

Master-slave classic, all changes are sent to one server, then copied to the mass of replicas.

Master-master true - when changes flow into a bunch of masters at the same time and somehow from one to another, from the other to the third and between them all, which gives rise to a number of joys and a series of automatic problems. It is clear that when you have one “golden copy” and several replicas from it, which should (ideally - instantly) repeat this “golden copy”, then everything is relatively simple in terms of how to drive data back and forth and what do on each particular copy. An interesting “headache” begins with master-master, and, I emphasize, not specifically in the case of MySQL, but purely theoretical. How to be, if on two nodes simultaneously tried to drive the same transaction, which changes the same data, and, changing them, for simplicity of example, in different ways. It is clear that we cannot apply these two changes at the same time. At the moment when we are starting to change something on one node, there is still nothing on the second node. Conflict. One of the transactions will have to be rolled back. In addition, separate "dances" begin with the verification of watches, etc.

A curious moment - even the option when you ultimately have all the changes from all the masters have to gradually spread everywhere will still not help that very write bandwidth. It's a shame, but like this.

A nice option is called “Master-slave + routing requests”. It is pleasant because it is easy to program inside, you have one main copy, you replicate it to a bunch of machines. This is much simpler than in the master-master environment, when everyone has equal rights, etc., but from the point of view of the application, it still looks like you have a lot of recording points. You come to any node, she knows where to cast you, and successfully routs. Well, the reads are scaled - that’s the happiness of replication. You can read everything from all points and always.

Now closer to databases, “magic” statement-based, row-based, etc. About the format changes.

What can you do? You can send the requests themselves, and you can transfer only the modified lines. I emphasize - while we have not yet dived into the wilds of MySQL, any DBMS that has a number of queries that generate a large number of changes (i.e., updating a lot of data. The question arises - what exactly will we copy? You can query yourself back and forth between the nodes to drive, and you can drive only the changed data. Interestingly, this way and that is very bad! You can still try to mix.

Another point about replication. Pro distribution model. Probably, somewhere until now, the Push-based model has not completely died out, when the node that made the changes must send them to all the other nodes. From the point of view of programming and tracking states, this is still a trouble. Therefore, Pull-based is driving. Taking updates from one or another node is much easier to program than to monitor a chaotic cluster of your replicas on one node.

Some common terms introduced. Moving on to what was done in MySQL.

MySQL itself is a kind of deception. There is a logical layer called MySQL, which deals with all sorts of common and isolated from data storage cases - network, optimizer, caches, etc. The specific physical layer that is responsible for storing the data lies on the floor below. There are several built-in, there are plug-in set. But even embedded MyISAM, InnoDB, etc. live on the physical layer. Plug-in architecture is cool, you can pick up a new engine, but instantly there is some kind of non-optimality. In principle, transactional write-ahead logs (WAL), which the physical storage layer still writes, would be good to use for replication, and if the system knows that there is a certain physical layer, or it is fairly well connected with this physical layer , it would be possible not to write a separate log at the logical level, but to use the same WAL. But in MySQL it is impossible conceptually, or if you change the interface in PSE so that it becomes possible conceptually, there will be a lot of work.

Replication is implemented at the level of MySQL itself. This is also good - in addition to one log in the form of deep internal data of the storage engine, there is a more or less logical log, perhaps at the statement level, which is maintained separately from this engine. And this is “extra” security, etc. plus, since there are no limitations inside, you can do any creative type of engine replacement on the fly.

In terms of MySQL 4.1, the following terms were implemented: master-slave, pull-based, strictly async and strictly SBR. If you are stuck in the ancient epoch 4.x, then, probably, you are all bad. Version 5.x is almost 10 years old - it’s time to upgrade.

It's funny to track on the versions how people attacked all kinds of rakes and, when it was impossible to do anything, fastened a new rake to this rake so that life would not be so painful. So, in version 5.1 they screwed up the RBR to compensate for the inevitable problems with SBR, and screwed up the mixed mode. In version 5.6, they screwed some more nice pieces: semi-sync, delayed slave, GTID.

One more thing. Since MySQL is a kind of common layer, on the one hand, and a bunch of pluggable engines, on the other hand, including built-in ones, there is a divine NDB cluster from a certain point, which is told about cool. There is a fully synchronous master replication, a very accessible in-memory database ... But there is one nuance - as soon as you start looking for people who use NDB cluster in production, there are very few people like that.

What does the wizard do when you decide to enable replication? On the master there are quite a few additional movements. As usual, we receive requests over the network, parse them, drive transactions, commit them, etc. In addition to this, at the logical level MySQL, the wizard begins to maintain a binary log file, not quite a text file, into which everything changes are being streamed. Also, the wizard can send these logs over the network. All this is very simple and it seems to work.

What does a slave do? It is better not to send changes to the slave, because you can get into the incomprehensible. The slave has a little more work. In addition to keeping one additional log and sending it on request, there is also a thread that goes to the remote master, perhaps not even one, and downloads the binary log from there. The decision “let's go to several remote masters and download various logs from them” is ambiguous. On the one hand it is not bad, but on the other there is an instant divergence. It is impossible to physically copy files via SCP, one log is already obtained on the server, we locate our positions locally, we add them to the grid, add them to a separate log, another thread runs and tries to play these local logs. The most hellish, in my opinion, is that up to version 5.6 the identification of a transaction in the log occurred by the file name and position on the master. An interesting decision.

Here is the write path that a simple insert passes without replicating:

The application is connected to the server, put in the table and hang up.

With replication, there are several additional steps:

The application writer also goes to the master in the same way, but in addition this data gets into binary log in one form or another, then it is downloaded over the network to the relay log, then it is gradually replayed from the relay log (if we are lucky, the slave does not lag, Replace immediately) in the table on the slave, after that everything is available in the reader.

What exactly gets into the binary log depends on the SBR / RBR / mixed settings. Where does all this grow? Imagine yourself as a database. We received a simple request “update one specific entry” - UPDATE users SET x = 123 WHERE id = 456

What to write in the binary log? In principle, all the same, in fact. We can write a short request, or (and he updated one record) we can write the change in some way in one format or another.

Another situation. Imagine that the same request, which is small in itself, but many changes data, arrived to us - UPDATE users SET bonus = bonus + 100

There is only one effective option - to write the request itself, because the request is exactly 32 bytes, and it can update an arbitrary number of records — 1000, 100 000, 1 000 000, as many as you like ... It is not efficient to write changed records to the log.

And what happens if we put such a simple request in the log “let's turn off all users who have not logged in for a long time” - UPDATE users SET disabled = 1 WHERE last_login <UNIX_TIMESTAMP (NOW ()) - 100 * 86400

Suddenly terrified. The problem is that if you perfect the request itself, then, firstly, the time is never synchronous between two nodes, besides, due to the fact that the recording path is so long, at the time of the replay this “NOW” will diverge. The replica suddenly diverges from the master, and all subsequent changes, formally speaking, already unsafe, can lead to anything.

Generally speaking, for such queries, regardless of the amount of changed data, ideally, the lines themselves should be copied. In this particular case, you can not copy the lines themselves, but fix the constant and write in the log not “NOW”, but the specific timestamp that was used by the master at the time of replication.

Fun facts that you accidentally learn, diving into the jungle of replication. Moreover, you can dive shallowly - you run up a rhinestone on them. In random order, they are:

the master is multithreaded, but the slave is not. It is clear that if the master pours the load into four cores, the slave does not have time to pour this load into one core. Everything is pretty bad;
The state of the slave is determined by the name of the position in the master file. Think about it - the state of one node in a cluster is determined by the file name and position in this file on another node of the cluster, from which anything can happen for any reason!
"Saving" RBR. It turns out that by default full before / after row image is written there, i.e. we changed one column in the five-kilobyte line, op! - 10 Kb of traffic and bytes of 20-40 overheads on this line, then op! - such fat line of the previous version goes, op! - the version with new values goes after that. Administrators howl in chorus! Nevertheless, it is just awesome from the point of view of some perverted applications, for example, external readers, who are trying to hook up to the MySQL server, pull data from it and do something with it, for example, poke it into a full-text index. How bad it is from the point of view of database administration, in which one change by three bytes generates 10 Kb of traffic per screw, and then 10 Kb of traffic across the network for each slave, it’s as good for any full-text search systems like Sphinx, which there is no local copy of the data, and there is no desire to implement MySQL from scratch. In MySQL 5.6, they remembered and did binlog_row_image (but by default full, not minimal or noblob).

In short, everything is not cleverly arranged - a stick, a rope, one log, the second log. And even in this log, “childhood” diseases are quite funny:

For a person who uses replication for two days, all this is scary and hard. But, knowing how simple it is, in principle, it is clear how to live with it:

first of all, we do not believe in defaults;
We carefully look at the settings, think what we want - SBR, RBR, etc.

And it’s better to set it up right away so you don’t disassemble the strange stuffing.

In the situation of “a rotten log, a position has gone, it is not known what is happening” there are certain tools - we look at the events, trying to understand which transaction has already slipped, which is not, can the whole thing be saved or restored, etc. If GTID “you managed to turn it on, then life becomes easier.

Another point of observing replication. It is interesting to see how the internal curved device provokes not that competition, but the creation of additional products. “Magic” Tungsten Replicator, they say, solves a problem called “single-threaded slave - this is bad,” and if it were not for inherent difficulties, there would not be an additional product that allows using this mechanism to transfer data to other systems, on the one hand and at the same time solve a number of problems embedded in the existing system, on the other hand.

As usual, it is impossible to advise. Helping someone, someone will spit hard. But, they say, there are situations in which Tungsten copes well with the inevitable single-threaded lag. I'm sure there are all sorts of fascinating tricks, but the internal single-threaded slave is hard.

What if you for some reason used replicas as a backup? I think you need to beat your head against the wall, because the replica and the backup are two different things. Nevertheless, if you are creative guys and use a fairly new version, delayed replication saves you, on the one hand, but on the other hand, if you don’t make full backups, you will not be saved by anything.

Next, another element of creativity. It is not difficult to imagine a situation where the master logged the entire 10 PB cloud disk or scored the entire network with these logs, and we don’t need 90% of these updates because we are interested to replicate, for example, one table sighting or one database sighting, and by default everything falls in bulk to the binary log - all changes across all databases, across all tables, across. The decision is again striking in its creativity. On the one hand, there are four settings - {binlog | replicate} _ {do | ignore} _db, which allow filtering on the wizard - what is written to the log and what is being ignored. On the slave, respectively, allows you to do the same. Those. on the master, we can filter out what gets into the binary log - in this funnel, which then merges into the network, and on the slave, respectively, we can put the incoming filter on what comes from the network. Or write to the disk only part of the data, and then replay on the slave, again, only part of the data. Suddenly, even in this simple story comes the horror, because the combination - use one database, and update the table in another database through an interesting syntax - it behaves somehow ... And how exactly it will behave - is unknown, because Different filters work at different times.

There are no built-in pleasant pieces called “re-election of the master, if he suddenly died”, you need to lift it with your hands. The lack of tools for cluster management, which, in my opinion, is good, creates competition, creates additional products. In fact, if in normal MySQL a very cool master replication master would work, or at least automatic lifting after failures, then why would you need any Galera, Percona / MariaDB Cluster, etc.?

Some more tricks. Interesting is the implementation of replication, which is as simple as a stick and a rope, without any checks, on the one hand, and without any tools, so that it is more pleasant to manage a replicating slave cluster, on the other. This is bad. But on the other hand, you can manually sculpt such interesting configurations from this, that everyone who comes will come and shudder after you.

Configuration number 1. The master on the knee in the MySQL style is done like this:

What scares - how many idiots in the world! Google “MySQL Master replication master” - every second link is like this. Hell and Holocaust.

Focus number 2 - catch-all slave - more pleasant. There are no unnecessary checks - what comes from whom, who gets to, and what to do with it. Due to this, you can make funny things like a slave, on which either part of the data is precisely merged from the heap of servers, or all the data from all the servers are merged - the server with all the backups. But, I repeat, there is replication, i.e. there is some basic tool that copies table A instead of B and that's it.

And finally, the trick number 3 - we replace everything. We recall that replication lives on a logical level that is in no way connected with the physical level of storage. Due to this, it can be extremely interesting to wonder. You can change the engine "on the fly" with incomprehensible goals - this is a true story that, they say, replication from InnoDB databases in MyISAM tables just for the sake of full-text search to work at least somehow. There is a creative trick called "changing the scheme through replication". What is fat, I refuse to understand, but there are also such tricks. Well, there is a clear and interesting mode of operation called “paranoid version upgrade through replication”.

During the report, we learned:

Nevertheless, it is possible to live with this hell, if at least roughly understand how it works.

The main message is that:

In 2015, at the HighLoad ++ Junior conference, Andrei Aksyonov read a new version of his report on the replication device in MySQL. We also deciphered it and published it on our blog.

Source: https://habr.com/ru/post/309326/

All Articles

How does MySQL replication work?

How does MySQL replication work?

Andrey Aksenov (Sphinx), shodan

About synchronization.

More articles: