Failover: perfectionism is ruining us and ... laziness

In summer, consumer activity and the intensity of changes in the infrastructure of web projects traditionally decrease, Captain Obvious tells us. Just because even IT people happen to go on vacation. And CTO, too. It is all the more difficult for those who remain in office, but this is not about it now: perhaps, this is why summer is the best period in which to rush to think about the existing reservation scheme and draw up a plan for its improvement. And in this, you will benefit from the experience of Egor Andreev from AdminDivision , which he spoke about at the Uptime day conference.

During the construction of reserve sites, when reserving there are several traps that can be accessed. And you can’t be caught in them at all. And perfectionism and ... laziness ruin us in all this, as in many other things. We are trying to make everything, everything, everything is perfect, but we don’t need to do it perfectly! It is necessary to do only certain things, but to make them right, to complete it so that they work normally.

Failover - this is not some kind of fun fun thing "so it was"; this is a thing that should do exactly one thing - reduce downtime so that the service, company, loses less money. And in all reservation methods, I suggest thinking in the following context: where is the money?
')

The first trap : when we build large reliable systems and are engaged in redundancy - we reduce the number of accidents. This is a terrible delusion. When we do backup, we most likely increase the number of accidents. And if we do everything right, then together we will reduce the downtime. Accidents will be more, but they will occur at lower costs. What is a reservation? - this is a complication of the system. Any complication is bad: we have more cogs, more gears, in a word, more elements - and, therefore, a higher chance of breaking. And they really break. And they will break more often. A simple example: let's say we have a certain site, with PHP, MySQL. And it urgently needs to reserve.

Shtosh (c) We take the second platform, build an identical system ... The complexity becomes twice as large - we have two entities. And we also roll in from above a certain logic of data transfer from one site to another - that is, data replication, static copying and so on. So, the replication logic is usually very complicated, and therefore, the total complexity of the system may not be 2, but 3, 5, 10 times more.

The second trap : when we build really big complex systems, we fantasize what we want to get in the end. Voila: we want to get a super-reliable system that works without any downtime, switches for half a second (or even better instantly), and we begin to make dreams come true. But there is also a nuance: the smaller the desired switching time, the more complex the logic of the system. The more difficult we have to do this logic, the more often the system will break. And you can get into a very unpleasant situation: we are trying to reduce downtime by all means, but in fact we are complicating everything, and when something goes wrong, there will be more downtime. Here you often catch yourself thinking: here ... it would be better not to reserve. It would be better if it worked alone and with understandable downtime.

How can you deal with this? We must stop lying to ourselves, stop flattering ourselves that we are going to build a spacecraft here, but to adequately understand how much the project can lie. And under this maximum time we will choose which, in fact, methods will increase the reliability of our system.

It's time for “stories from g” ... from life, of course.

Example number one

Imagine a business site for the pipe-rolling plant No. 1 of N. City. It is written in huge letters - TUBE-ROLLING PLANT No. 1. Just below is the slogan: “Our pipes are the most round pipes in N”. And below the phone number of the CEO and his name. We understand that you need to reserve - this is a very important thing! We begin to understand what it consists of. Html-statics - that is, a couple of pictures, where the general, in fact, at the table in the bathhouse, with their partner, are discussing some kind of regular deal. We start thinking about downtime. It comes to mind: it takes five minutes to lie there, no more. And then the question is: how much sales have there been from this site? How much? What does “zero” mean? And that is what it means: because over the past year, all four deals were made by the general at the same table, with the same people with whom they go to the bathhouse, sit at the table. And we understand that even if the site lays down the day - nothing terrible will happen.

Based on the introductory, there is a day to raise this story. We start thinking about the redundancy scheme. And we choose the most ideal reservation scheme for this example: we do not use redundancy. This whole thing is raised by any admin in half an hour with smoke breaks. Put the web server, put the files - everything. It will work. There is no need to follow anything, nothing needs to be given special attention. That is, the conclusion from example number one is pretty obvious: services that are not needed to be backed up are not needed.

Example number two

Company blog: specially trained write news there, so we took part in such an exhibition, but we released another new product and so on. Let's say this is standard PHP with WordPress, a small database and a little bit of static. Of course, it comes to my mind again that in no case can one lie - “no more than five minutes!”, That is all. But let's think further. What is this blog doing? They come from Yandex, from Google for some requests, by organics. Great. And are sales somehow related to him? Insight: not really. Advertising traffic goes to the main site, which is on another machine. We are starting to think about the backup scheme of a blozhek. In a good way, it needs to be raised in a couple of hours, and it would be nice to prepare for this. It would be reasonable to take a machine in another data center, roll in the environment, that is, the web server, PHP, WordPress, MySQL, and leave it lying dimmed. At the moment when we understand that everything is broken, you need to do two things - roll out a mysql dump to 50 meters, it will fly there in a minute, and roll out some number of pictures from the backup. This, too, there is not God knows how much. Thus, in half an hour the whole thing rises. No replication, or God forgive me, automatic failover. Conclusion: the fact that we can quickly roll out of the backup is not necessary to reserve.

Example number three, more difficult

Online store. PhP with an open heart slightly sawed, mysql with a solid base. Quite a lot of static (because the online store has beautiful HD pictures and all that), Redis for the session and Elasticsearch for the search. We start thinking about downtime. And here, of course, it is obvious that the day the online store can not lie painlessly. After all, the longer he lies, the more money we lose. It is worth accelerating. How much? I suppose if we lie down for an hour, then no one will go mad. Yes, we will lose something, but we will begin to be zealous - it will only get worse. We determine the scheme of idle time allowed per hour.

How can you reserve all this? The machine is needed in any case: an hour of time is quite a bit. Mysql: replication is already needed here, live replication, because in an hour 100 GB in a dump, most likely, it will not merge. Statics, pictures: again, in an hour 500 GB may not have time to merge. So, it is better to copy the pictures immediately. Redis: here is more interesting. In Redis, the sessions lie - we just cannot take it and bury it. Because it will not be very good: all users will be logged out, baskets cleared and so on. People will be forced to re-enter their username and password, and many people can split off and the purchase is not completed. Again, the conversion will fall. On the other hand, Redis is straightforward one-to-one, with the latest logged in users, probably also not needed. And a good compromise is to take Redis and restore it from a backup, yesterday, or, if you do it every hour, it is an hour ago. The blessing of restoring it from backup is copying a single file. And the most interesting story is Elasticsearch. Has anyone ever raised MySQL replication? Has anyone ever raised Elasticsearch replication? And who after she worked normally? What I need: we see in our system some entity. She seems kind of useful - but she is complicated.
Difficult in the sense that our fellow engineers have no experience with it. Or have a negative experience. Or we understand that while this is a fairly new technology with nuances or dampness. We think ... Damn, elastic is also healthy, it is also a long time to restore it from the backup, what to do? We understand that elastic in our case is used for searching. And how does our online store sell? We go to marketers, ask, where people come from. They answer: "90% of Yandex Market come directly to the item card." And either buy or not. Consequently, the search needs 10% of users. And to keep the replication of elastic, especially between different data centers in different zones, there are, indeed, many nuances. Which exit? We take elastic on a reserved site and do nothing with it. If the matter drags on, then we will probably raise it sometime later, but this is not certain. Actually, the conclusion is plus or minus the same: services that do not affect the money, we, again, do not reserve. To make the scheme easier.

Example number four, even harder

Integrator: selling flowers, calling a taxi, selling goods, in general, anything. Serious thing that works 24/7 for a large number of users. With a full-fledged interesting stack, where there are interesting bases, solutions, high load, and most importantly, it hurts him to lie for more than 5 minutes. Not only and not so much because people will not buy, but because people will see that this thing does not work, will be upset and may not come at all for the second time.

Okay. Five minutes. What are we going to do with it? In this case, we are grown-up, with all the money we are building a real backup site, with replication of everything and everything, and maybe even automating the switching to this site to the maximum. And in addition to this, you need to remember to do one important thing: actually, write the switching rules. The rules, even if you have automated everything and everything, can be very simple. From the series “to launch such a scenario ansible”, “to press such a daw in route 53” and so on - but it must be some exact list of actions.

And everything seems to be clear. Replication to switch is a trivial task, or it will switch itself. Rewrite the domain name in dns - from the same series. The trouble is that when a similar project falls, a panic begins, and even the strongest, bearded admins may be subject to it. Without a clear instruction “open the terminal, go here, the address at our server is still like this” the period of 5 minutes allocated for resuscitation is difficult to sustain. Well, plus, when we use these rules, it is easy to fix some changes in the infrastructure, for example, and change the rules accordingly.
Well, if the backup system is very complicated and at some point we made a mistake, then we can lay down our backup site and, in addition, turn the data into a pumpkin at both sites - it will be completely sad.

Example number five, full hardcore

International service with hundreds of millions of users worldwide. All time zones that only exist, highload at max speeds, cannot be laid at all. A minute - and it will be sad. What to do? Reserve, again, in full. They did everything that was mentioned in the previous example, and a little more. Ideal world, and our infrastructure - it is in all notions of the IaaC devops. That is, everything in general in git, and just press the button.

What is missing? One - the teachings. Without them it is impossible. It seems that everything is perfect here, everything is under control. Press the button, everything happens. Even if this is so - and we understand that this is not the case - our system interacts with some other systems. For example, this is dns from route 53, s3-storage, integration with some api. We will not be able to foresee everything in this speculative experiment. And while we really do not pull the switch - do not know whether it will work or not.

On this, probably, everything. Do not be lazy and do not overdo it. And may the uptime be with you!

Source: https://habr.com/ru/post/460611/

All Articles