📜 ⬆️ ⬇️

Fear and Loathing in Distributed Systems



Roman Grebennikov explains the complexity of building distributed systems. This is a report Highload ++ 2016.

Hello everyone, my name is Roman Grebennikov. I work for Findify. We do a search for online stores. But the conversation is not about that. In Findify, I deal in distributed systems.
')
What is a distributed system?



From the slides it is clear that “fear and hatred” is a common thing in IT, and distributed systems are not quite ordinary. We will try to understand first what is distributed systems, why there is so much pain, and why we need this report.

Imagine that you have some kind of application that works somewhere there. Suppose it is server-based, but it differs from ordinary applications in that it has some kind of internal state. For example, you have a game, the internal state is a world where human beings enter and so on. Sooner or later you grow, and your internal state all swells, it changes and ceases to fit on one server.



Here in the picture Winnie the Pooh is stuck in a hole.

If you started to grow and no longer fit into one server, then you need to do something.

You have several options.

You can take a more powerful server, but this may be a dead end, because you may already be working on the fastest server you have.

You can optimize, but that - it is not clear, on click everything is accelerated twice as hard.

You can get on a very shaky path - the creation of distributed systems.

This path is shaky and scary.

What am I going to talk about today?


First we will talk about distributed systems. A little bit of materiel, why it is important, what integrity is, how to integrate this integrity, what are the approaches to the design of distributed systems, which the data does not lose or lose them quite a bit, what tools are there to check your distributed systems for what you have everything is good with data or almost everything is good.

Since one theory is boring, we'll talk a little bit about the case and a little bit of practice. We’ll take this laptop and write our own little simple distributed database. Not really a database, there will not be a key value store, but a value store. Then we will try to prove that it does not lose the data, moreover, repeatedly in particularly terrible positions.

Next, a little philosophy of "how to live with it."

We imagine that distributed systems are such things that work somewhere far away, and we all sit here. It can be servers, it can be mobile applications that communicate with each other, they have some kind of internal state. If you and your friends are texting, then you are also part of a distributed system, you have a general condition, what are you trying to negotiate. In general, a distributed system is a joke that consists of several parts, and these parts communicate with each other. But everything is complicated by the fact that they communicate with delays and errors. This is all very complicated.

A small example from life.

We once wrote a web spider that went on the Internet, downloading all sorts of different pages. He had a big queue of tasks, we piled it all up. We have a few basic operations at the queue. This is something to take from the queue, something to put in the queue. We also had a third operation for the queue: check if there is an object in the queue, so that you don’t put the same thing two times.



The problem was that the queue was quite large, and it did not fit into the memory. We thought: what is so complicated here? We are smart, we go to HighLoad. So let's cut this queue into pieces, roll it out to different servers. Each server will deal with their own piece of queue. Yes, we will lose a little in integrity, in the sense that we will not be able to take the very first element from the queue, but we will be able to take almost the very first. Just choosing the case shard, taking from it, and all is well. That is, if you take from the lineup, almost everything is fine, the logic has become a bit more complicated. Putting in a queue is also easy, we look at which shard to put and put. Check if there is a queue, also no problem. Yes, business logic has become a bit more complicated, but at least it has become uncritical, and there seems to be no blood here.

What could be the problem


We understand that if we have added some kind of network interaction, and we have more components, then the system has less reliability. If less reliability, then surely something will go wrong. If with software and hardware, everything is clear. With iron, you can take the server more recently, with the software - do not deploy drunk on Friday. And with the network, such a thing breaks down no matter what you do with it. Microsoft has a great article with network hardware failure statistics, depending on the switch type in Windows Azure. The likelihood that the port load balancer grunts during the year is about 17%. That is, if you do not envisage what to do in case of refusal, then sooner or later you will have someone to eat good.

The most popular problem that happens with a network is NETSPLIT. When your network has collapsed in half, either it has constantly collapsed, or you have packet loss. As a result, she then fell apart, then did not fall apart.

What will we have with the queue at the sharding, if we have problems with the network?

If we take something from the queue, if we are ready to take not the very first element from the queue, then we can simply take from the shards that are available to us, and somehow continue to live.

If we need to check if there is a component in the queue, then everything is complicated by the fact that if we need to go to that shard that is not available to us, then we can do nothing.

But everything gets complicated when we need to put something in the queue, and the shard is inaccessible. We have nowhere to put it, because that shard is somewhere far away. Nothing remains but to lose it, or to put it somewhere and then do something about it. Because we have not laid the failures in the design of the system.



In the picture the uncle is sad. Because he did not lay the failures in the design of the system. We are smart, we know that for sure there are other uncles who are very smart and have come up with many different options for how to live with this problem.



Here on the scene appears CAP-theorem.

CAP-theorem is the cornerstone of the design of distributed systems. This is a theorem, which is formally not a theorem, but a rule of thumb, but everyone calls it a theorem.

It sounds as follows. We have three whales of building distributed systems. It is integrity, availability and resistance to network problems. We have three whales, we can choose any two. And not exactly any two, almost any two. We'll talk about this a little later.

In order - what is integrity, accessibility and sustainability. Same as a theorem, there should be a formal description.

Availability


This is availability, and it implies constant availability. Each request to the system, any request to the system to any live node must be successfully processed. That is, if we postpone a part of requests somewhere for later, or we postpone recordings to the side, because something went wrong with us - this is intermittent availability. If not all nodes respond to requests, or not all requests all nodes respond - this is also inconsistent availability from the point of view of the CAP theorem.

If it answers you like this:



This is also intermittent availability.

Integrity


The next item is integrity. From the point of view of integrity, you can say that there are so many different types of integrity in distributed systems:



There are about 50 of them. What are some of them in the CAP theorem?

In the CAP-theorem the strictest type. Called linearizability.

Integrity, linearizability. It sounds very simple, but it has great consequences. If operation B started after operation A, then B should see the system at the time of termination A or in a newer state. That is, if A is completed, then the next operation cannot see what it was before A. It seems that everything is logical, there is nothing complicated. To rephrase it in other words: "There is a consistent history of sequential operations."

Now we will talk more about these stories.



Imagine we have some sort of register. This is what we can read, only what we wrote down before. Just one hole for a variable. We have one reader-writer. We read and write everything, nothing complicated. Even if we have several readers and writers, nothing too complicated.



But as soon as we move from the slide to the real world, this diagram looks a bit different, because we have network delays. We do not know exactly when. The record happened between w and w1, where exactly it happened, we do not know. The same with the readings. From the point of view of history, for example, we can record three such unpretentious stories.



First we read a, then we wrote down b, then we read b, clearly as in the picture. In principle, another story is possible, when we read a, we read a again, and then we wrote down b, if everything is as in the picture.

The third story. If we read a, and then suddenly read b, and then write b, it contradicts itself, because we read b before it was written. From the point of view of linearizability, such a story is not linearizable, but the CAP-theorem requires that there is at least one story that does not contradict itself. Then your system is linearizable. There may be several.

Resilience


The last item is the letter P. Partition tolerance, in Russian we can say that it is “resistance to network failures”. It looks like this:



Imagine that you have several servers, and early in the morning I drove through an excavator and cut the wires between them. You have two choices if your cluster is split in half. The first exit: the big half lives, and the smaller has fallen off. You have lost accessibility because the smaller one has fallen off. But the big one lives. But not lost integrity. Either both halves work, we take notes here and there, we accept everything, everything is fine. Only then, when the wire is soldered, we will understand that we had one system, and there were two, and they live their own lives.

From the point of view of the CAP-theorem, we have three possible approaches to system design. These are CP / AP / AC systems depending on two combinations of the three.

There is a problem with AC systems. On the one hand, they ensure that we have high availability and integrity. Everything is cool, until we have broken the network. And since this often happens, in the real world, AC systems can be used, but only if you understand the trade-offs you make when you use AC systems.

In the real world there are two choices. You can either shift in integrity and lose availability, or shift to availability, but lose integrity. There is no third.

In real life there are many algorithms that implement various CP / AP / AC systems. Two-phase commit, Paxos, quorum and other rafts, Gossip and other heaps of algorithms.

We will now try to implement some of them and see what happens. You can say: “10 minutes has passed, our head has already exploded, and we just arrived.” Therefore, we will try to do something in practice.

What we will do in practice


We will write a simple master-slave distributed system. To do this, we take Scala, Docker, pack everything. We will have a master-slave distributed system that has asynchronous / synchronous replication. Then we get Jepsen and show that in fact we wrote everything correctly or incorrectly. We will try to explain the result after we run Jepsen. What is Jepsen? I will tell you a little later, surely many of you heard about it, but you did not see it with your eyes.

So, master-slave. In general, it looks like a basic thing. The client sends a write request to the master. Master writes to disk. The master synchronously or asynchronously scatters it all into slaves. Synchronously or asynchronously responds to the client, either before it scattered as records, or after.

We will try to understand from the point of view of the CAP-theorem, how things are with integrity, availability and other things.

Let's try something.



There is a small preparation here, then I will write more for karaoke. We have two functions that will help us work with other servers. That is, we will use HTTP as the easiest way to communicate between nodes in a distributed system. Why not?

We have two functions. One writes this data to this node:



Another function that reads some data from this node, and does all this asynchronously:



Also a useful feature that parses Response. Takes Response, returns String:



Nothing complicated.

To begin, we will write a simple server of our distributed system.



To begin with, we have homework here. The data inside our system, we will store the entire stat in one variable, because we have a live demo here, and not a real database.



Therefore, we just store the string, we will replicate it and so on. There are all sorts of useful jokes, such as - read from the variable environment HOSTNAME, NODES neighboring and so on.



We will write plugs for two functions here.



Reading from our distributed system and writing from our distributed system. Of course, we will not implement them now, but we will go further.

Here we launch our balalaika:



Everything is very simple. We have a route function, which is not yet implemented, but it does something. It describes the rest route that we will use.



We are stretching all this to the 8000th port:



What route will we have? We will have two route. The first is db.



This route is for us, for database clients. We work with him, and she is doing something under the hood.

If we get there, we call our magic read function, which we have not yet implemented.



If we post some data there, then we call our write function. It seems nothing complicated.



In addition, we will have another route, called local.



It is made not for us, but in order that the members of the distributed system, different nodes can communicate with each other. One of the other could read what she had written there.

If we get there, we read our magic variable value.



If we do a post there, then we write what we posted there in this variable.



Nothing complicated. A little brain explodes from Scala, probably, but that's okay.

We kind of wrote our server. It remains to make the logic for our MasterSlave. We will now create a separate class that will be implemented by MasterSlave. Reading from asynchronous MasterSlave, in fact, is just a local reading of what we have written to our variable there.



Writing is a bit more complicated.



We first write to ourselves.



Then we go over all the slaves that we have and write to all the slaves.



But this function has a feature, it returns Future.



Here we pop out asynchronously all requests to write to the slaves, and we say to the client: “Everything is OK, we have recorded, you can go on.” Typical asynchronous replication, possibly with pitfalls, now we'll see it all.

Now we try to compile it all. This is Scala, it makes it long. In general, it should compile, I rehearsed. Compiled

What to do next


We wrote, but we need to start all this, and I have one laptop. And we make a distributed system. A distributed system with one node is not a completely distributed system. Therefore, we will use Docker.



Docker is an application containerization system, everyone probably heard about it. Not everyone probably used in production. We will try to use it. This is a light virtualization system, if you simplify everything. Docker has a rich ecosystem around, we will not use all of this ecosystem. But since we need to launch not one container, but a group at once, we will use Docker Compose to roll them all together at once.

We have a simple Dockerfile, but it is not quite simple.



Here the Dockerfile, which installs Java, puts SSH in there. Do not ask why it is needed. Our application starts it all.

And we have a Compose file that describes all 5 nodes.



We have a description of several nodes here. We will try to fix it now. I have a script for this.



While it is deployed, I will change the colors.



Now it creates a Docker container. Now it will launch it. All of our 5 nodes are running.



Now we are waiting for our distributed system to knock that it is alive. It takes some time, it's Java. Our MasterSlave said it started.



In addition, we have unpretentious scripts of everything, which will allow us to read something from this distributed system.

Let's see what we have in node n1.



It recorded 0.

Since there was no protection against who is master, who is slave, we will take on faith that we always have n1 master. We will only write to it only in order to simplify everything.

Let's try to write something into this master.

put n1 1

There was a unit. Let's see what we have here in the logs.



Here we have our unit here in the record, we have this record rolled over all the slave, here she signed up.



We can even go to the node n3, let's see what is there.



There is one.

We can go for a bonus, we wrote our distributed system on the knee, which even works. But it works, so far so good.

Now we will try to make her feel bad. In order to make it bad, we will take such a framework called Jepsen. Jepsen - a framework for testing distributed systems, with one subtlety: it is written in Clojure. Clojure is a lisp who does not know. This is a set of ready-made tests for existing databases, queues and other things. You can always write your own.In addition, there are a lot of articles about the problems found, probably in all databases, except for exotics. Perhaps not only got ZooKeeper and RethinkDB. They got, but slightly compared to the others. You can read about it.

How does Jepsen work


It simulates network errors, then generates random operations to your distributed system. Then he looks at how these operations were applied to your distributed system and to the reference behavior, to the model of this distributed system, and whether there are any problems with this.

If Jepsen you have found such a problem:



This is where I wanted to joke.

If Jepsen found some problem with you, he found counterexamples for your distributed system, and there really is some kind of a jamb. But since the Jepsen tests are probabilistic in nature, if he doesn’t find anything, he may not have been looking well enough. But if he looks good, he will find something. In the case of, for example, RethinkDB, they drove tests about two weeks before the release to prove that it worked more or less somehow.

We are not going to drive tests for two weeks, we will be five seconds. Our task with the Jepsen test is to write to master, read from MasterSlave and understand how things are with integrity and whether we wrote our master / slave replication correctly or maybe not.

The jepsen test consists of several important parts.



We have a generator that generates random operations that we apply to our distributed system. The distributed system itself, which we launched somewhere. It can be in Docker, maybe on real iron servers, why not. And we have a reference model that describes the behavior of a distributed system. In our case, this is a register, what we wrote down there, and we should read this. There is nothing complicated. Jepsen has a huge number of models for all occasions, but we will only use the register and Checker, which checks the compliance of the operation history applied to the distributed system for their compliance with the model.

The problem is that Jepsen is written in Clojure, and tests need to be written too Clojure. If there was an opportunity to write them on something else, it would be cool. But trouble, trouble. Clojure is a language where there is always a list. If, for example, you want to add two numbers, then you make a list in which the first element is addition, and then two numbers, in the end you get three.



You can set your function to call another function called defn and say that the first argument is the name of the function, then the argument is the function, then the function body. If you call her like this, she will say “hello, highload!”. This is a Clojure course for beginners.

You can say that this Clojure course looks like this:



And now for sure it will be, what is there on the right. In fact, yes. But I told you that it is enough for you to at least understand something that will now take place in frames, because Clojure is a bit specific.

So, Jepsen. To begin, we describe our test.



This footcloth should not be read correctly from left to right, from top to bottom, because it is a lisp, it is better to read from inside and outside. We have some function that returns the description of our test, which we have not yet implemented. We put this description into another function that runs this test, and it returns a reference book. In this directory, we pick out something on the key results and see if there is a key in this directory called valid. If it is there, then the test is passed. Clojure reads like this. We must first break the brain, but then everything becomes clear.

Now we describe our test.



Our test is also a function in which there are no arguments. It extends another test that does nothing at all, and complements it with some things. For example, the name, the client that he will use to communicate with our database, because Jepsen does not know about what, nor about what HTTP. Checker, which we have not yet written, but we will write. The model that we will use as a standard of our distributed system. A generator that takes random read and write operations, inserts a delay of 10 milliseconds between them, starts them with the client and lets it all out for 5 seconds. Then there to work with SSH.

Now we describe our reading and writing. This is also a function. It's Clojure, it's all functions, there's nothing else.



We also describe our client, but first an HTTP client.



We have here several functions for writing in HTTP, for reading in HTTP. We will not go into how all this is done. But in fact, if we returned 200, then everything is OK. If 409 is HTTP 409 Conflict, a very useful code, then something is wrong. We will use it a little further.

Next we describe our client.



The same function that takes the host that we are going to write now and extends the interface called client. He has three functions. Setup which returns itself. Teardown which does not need to remove the client, and there is nothing inside. invoke which our operation, which we pass there, applies to this host. We have a five-second timeout here, that is, if our distributed system did not work in 5 seconds, then we can say that it will not work either. If we have this reading, we execute an HTTP-read, make a GET request. If a record, then we do an HTTP-write, that is, a POST request. Since we have MasterSlave and we have only one Master, we will change a little here, we write all the time in Master.



The last item that we have left is the Checker. It could be used, which is inside the Jepsen called linearizability checker, but we will not use it because I stepped on it.



I do not know what to do with it. Therefore, I wrote my own, which does not fall. In fact, he uses the same thing, just does not try to generate a beautiful picture at the end, where he usually falls.



We wrote our mega-test, let's try to run it.



We have our distributed system, the test, we are waiting for it to start.

lein is such a build system for Clojure.

It began to write different hosts, different read / write, in the logs, too, the magic is going on - read / write, replication. In the end, it tells us “fail”. You forgot something here. There is a list of stories that are not linearizable, and something here is clearly messed up.

Let's look at this situation:



Everyone knows that if we do asynchronous replication in the Master / slave, then the slave will always be late. What does the slave delay mean in terms of linearizability? This means that a recording happened here, after some time the recording was replicated, and reading happened here. As a result, we read what we should not have read. We had to read B, and read A. That is, we returned back in time, when the operation seemed to have already ended, and we went back and read what was from it. Because the Master at “asynchronous replication” says “ok” too early, because not all slaves caught up with the replication stream. Therefore, the slave always lags behind the master.

Of course you say that this is a kindergarten, let's do everything in an adult way. Let's write synchronous replication, we are smart. It will be almost the same as asynchronous, only synchronous.



We will override one function that writes to our distributed system. She is now doing it intelligently: she is waiting for the moment when all the slaves say ok. Only then she says that well, everything is recorded. If one of the slaves said that he wasn’t “ok,” it means we had trouble and we don’t go there.

Now we will try to launch this whole balalaika. Probably, you feel that now there will be some kind of setup, everything cannot be so good. 25 minutes has passed, and then the person is all live, and it all worked, so we’ll wait until it compiles. Now it collects everything, it’s not dynamic languages ​​for you, when everything works right away, everything is long and painful here. There is a time to think, whether you wrote, or wrote something else, rewrite everything until it is deployed.

Master / slave sync started, great. Let's now run the Jepsen test again in the hope that everything will be fine. We did synchronous replication, which can go wrong. It starts, starts writing, but not so fast anymore. He writes again, reads again, something happens there. As a result, something bad is happening again.



We again had a fail: it gave us a huge amount of stories that are not linearizable in terms of linearizability.

What is the problem?

We thought we did everything right. Somehow everything turned out badly.



The picture shows that everything has become even worse than it was.

Imagine this situation.



We have our master node 1. We recorded operations B and C there. We have C here, we replicated it here. We also do everything asynchronously over HTTP, which can go wrong. In node 2, they recorded it in this sequence. And at node 3, they messed up, we randomly sent them. We forgot that there are magazines and other things. They were not applied in the sequence in which they were to be applied. As a result, instead of C, we got B, and here it is not clear what. Whether B, or C, although they should read C. Therefore, such is the trouble.

From the point of view of the master / slave CAP theorem and integrity, replication depends on whether synchronous replication or not. If asynchronous, then it is clear that there is no linearizability, because slaves are late. From the synchronous point of view, it depends on how curved you are and how you implement everything correctly. If you're lucky, that's good. If the same as me, then bad. From the point of view of accessibility, the problem is that our slave cannot write, he can only read. In terms of high availability, we cannot fulfill all the requests that come to us. We can do only reading. Therefore, we are not available.

Therefore, the CAP theorem is a very specific thing. It must be applied very carefully to databases, because its definitions are strict, they do not always describe what is in real databases. Because it describes 1 register. If you can reduce all your transactions and other transactions to one register, then this is good. But usually it is difficult or even impossible. Accessibility is such accessibility, but latency says nothing. If your distributed system is fucking consistent, but responds to the request once a day, then it is also very difficult to use in production. There are a lot of different practical aspects that are not considered in any way in the CAP-theorem. The same loss of partitions packets, that is, if we have a network partition that is non-permanent, and variable, once every 5 seconds, several packets are lost.

Therefore, from personal experience, when people start writing distributed systems with an immature mind, sooner or later their knowledge crystallizes, all crutches fill new bumps, and something turns out like a curved oblique consensus algorithm. When all your nodes try to agree on a common state, but this algorithm is not very tested in extreme cases, they are rarely, why worry about it. But the consensus in the general case is a general condition agreement with the possibility of even how to survive failures, if they happened.

Such an example:



We have a client who writes to our distributed system, in which there are seven nodes. We vote here. If most of the nodes agreed that we have to write it all down, then everything seems to be fine. The same with reading. If we have a minority node fell off, 1-2-3, then do not worry. We have not lost in availability or integrity, because everything is fine. If the majority fell off, then we have lost accessibility, because we can no longer execute recordings, but have not yet lost integrity.

Let's do a live demo on your knee.



We will try to make a quorum. Let's try to get Jepsen again and try to explain what we can do.

Everything will be a little more complicated here, because we seem to be already smart. To begin, we will make a function that will describe the size of the quorum.



For three nodes - these are two, for five nodes - three, and so on. Just a banal majority.

Next we describe our magic functions read and write.



If we read from our distributed quorum system, we ask all HTTP nodes that you have it there:



Then we call our function, which will work with the quorum, which we now write:



In the end we say that we should succeed in the end:



Approximately the same with the record. We write to all nodes, see what is recorded:



Have we got quorum or not:



Further we form some answer to the user:



How will we live with a quorum?



This is a little Scala, more canonical. We have here a sequence of answers that came to us from different servers, from different nodes of our distributed system. This is practically just a sequence of lines that we read, for example, 0000.



We group them by ourselves:



We look how often what groups meet:



Sort by popularity:



We take the most popular answer:



It seems like everything is correct.

Next is the function that forms the answer:



She is also very canonical on Scala, it is not entirely clear what is happening.

We take a couple: the most popular answer and the number of votes:



And we look, if the number of votes is greater than our quorum or equal to this quorum, then everything is OK, we recorded it. We say that it is normal, the recording was successful, the quorum was going.



What can go wrong?

If something is wrong, then we say our magic HTTP 409 Conflict:



Now we will try to screw it all:



And redo it.

We will now try again our magic scripts that make. Let's see what we have in node n2.



Now it will be slow, such a meditative process. Almost. We are waiting for her to be knocked out. Tapped off.

Let's see what we have in node n2.



There lies 0. We see what happened here:



Local read, we have 0 read with five votes, generally cool. This distributed system.

Let's write something there:



We recorded there one. She signed up:



We have rolled out this one here on other nodes. We have a quorum here, one with five votes. It seems like everything is fine.

Let's return to our Jepsen. Because we have a quorum, and a quorum can have many masters. In general, there is no such thing as a quorum.

Here we will remove our crutch in the client:



And we will try to restart it all and see what will happen to our quorum when we run our integrity test again. If you abuse the code, then with all the force.

It began to write, a lot of things happen here.

In the end, it told us true:



It seems like everything is fine. But I told you about this that Jepsen is such a certain probabilistic nature of tests. Therefore, instead of 5 seconds, we will include 15 here:



And for reliability we will run this test again. In the hope that he might bring something else to us, catching bugs is probabilistic. Very interesting and relaxing process. Let's hope that at the second attempt, it will find something for us. Now we chase her a little longer, 15 seconds, in the hope that our quorum will grumble somewhere. Almost. Everyone is looking forward to what he will tell us. It told us fail:



What we waited for. What happened to fail here?

Let's look at these two functions abnormally here:



There is something wrong here. First, because it is Scala, there is always something wrong. Secondly, there is some problem in logic.

Let's look at the following sequence of records:



We have two customers. We recorded operation A here, then the nodes said “ok”, we recorded, we have a quorum here. But until we told the client that we had recorded A, a faster client came running here, who managed to write B, the nodes told him that the quorum had converged, and we also managed to say that everything was fine. We told this client that we wrote B, we said that we wrote A. Here we read something, what did we read here?



Since time is short, I will tell you that we will read B here, and this is all non-linearizable. Because we should have read A, because there should not be such a situation. Because we have to resolve conflicts in our system.

To avoid this, it is not necessary to reinvent the wheel and write quorum yourself, unless you are the author of the RAFT or PAXOS algorithm. This algorithm helps us all to do normal distributed systems.

The PAXOS & RAFT quorum algorithms describe a finite state machine, and your finite state machine goes between states. All these transitions are logged.



These algorithms describe the agreement on the order of operations in this log. They describe the choice of the wizard, the application of the operation, how to put this log on your nodes. But if you have the same log on all nodes, then your finite state machine, executing this log from beginning to end, will come to the same state on all nodes. It seems like a good thing.

The problem is that PAXOS is quite complex. It is written by a mathematician for mathematicians. If you try to implement it, you will have another variation of the implementation of PAXOS, which is very much, even scary to think how much. It is not for people, it is not divided into phases. This is just a huge footcloth, describing mathematical constructions. It is necessary to think out a lot. All these variations of PAXOS and the implementation of PAXOS are about how to think out what is written in this paper, in the direction of real implementation, which can be used in production.

To avoid this, there is such an algorithm RAFT. It is newer, all problems are taken into account, all the steps that need to be taken to implement a good consensus algorithm are described there. Everything is cool there. There are a huge number of different implementations:



There is only for Java, but there are implementations for all languages, even in PHP, I think, there is, although it is not clear why. You can take this library and try to test it for yourself. Not all of them implement the correct RAFT. I tried to use akka-raft, which seems to be supposed to work, but for some reason it did not pass the Jepsen tests, although it seems to be written that it should sometimes.

Consensus algorithms are used in many places, even in the third version of MongoDB, RAFT appeared. In Cassandra all my life has been PAXOS.In many databases, queuing systems, when they grow to maturity, sooner or later an algorithm of consensus appears.

In general, when you write your distributed system, you need to understand that you have many paths you can follow. You should know that there are these ways and not to make another one of your own. When you choose a path, you need to understand that every path you have is a compromise between integrity, latency, throughput, and other different things. If you understand this compromise and can apply it to your subject area, then all is well.



We can not say which of these approaches is better. It all depends on you. It all depends on your business task. These are just tools.


Report: Fear and Loathing in Distributed Systems.

Source: https://habr.com/ru/post/322876/


All Articles