Rollout history that affected everything

Enemies of Reality by 12f-2

At the end of April, while the white walkers besieged Winterfell, something more interesting happened to us, we made an unusual rollout. In principle, we constantly roll new features in the prod (as well as all). But this one was different. Its scale was such that any potential errors that we could make would have affected all our services and users. As a result, we rolled out everything according to the plan, in the planned and announced downtime, without consequences for the sale. The article is about how we have achieved this and how anyone can do it at home.

I will not now describe our adopted architectural and technical solutions, to tell how it all works. These are rather notes on the margins of how one of the most difficult roll-outs took place, which I observed and in which I took direct part. I do not pretend to be complete or technical details, perhaps they will appear in another article.

Background + what is this functionality

We are building the Mail.ru Cloud Solutions (MCS) cloud platform where I work as a technical director. And now - it's time to attach to our platform IAM (Identity and Access Management), which provides unified management of all user accounts, users, passwords, roles, services, and so on. Why is it needed in the cloud? The question is obvious: it contains all user information.
')
Usually such things begin to build at the stage of the most starts of any projects. But MCS has historically been a little different. MCS was built in two parts:

Openstack with its own Keystone authorization module,
Hotbox (S3 storage) based on the Cloud Mail.ru project,

around which then appeared new services.

In fact, it was two different types of authorization. Plus, we used some of the individual Mail.ru developments, for example, Mail.ru shared passwords, as well as a self-written openid connector, which provided SSO (pass-through authorization) in the Horizon panel of virtual machines (native UI OpenStack).

To make IAM meant for us to connect all this into a single system, completely our own. At the same time, we don’t lose any functionality along the way, create a reserve for the future, which will allow us to transparently modify it without refactoring and scale in terms of functionality. Also at the start, users had a role-based model of access to services (central RBAC, role-based access control) and some other trivialities.

The task turned out to be non-trivial: python and perl, several backends, independently written services, several development teams and admins. And most importantly - thousands of live users on the production combat system. All this had to be written and, most importantly, rolled out without victims.

What are we going to roll out

If it is very rough, somewhere in 4 months we have prepared the following:

We made several new demons that aggregated functions that used to work in different parts of the infrastructure. The rest of the services were assigned a new backend in the form of these demons.
We wrote our central password and key repository available for all our services, which can be freely modified as we need.
We wrote from scratch 4 new backends for Keystone (users, projects, roles, role assignments), which, in fact, replaced its base, and now acts as a single repository of our user passwords.
They taught all our Openstack services to go after their policies to a third-party policy service instead of reading these policies locally from each server (yes, by default Openstack works this way!)

Such a big rework requires large, complex and, most importantly, synchronous changes in several systems written by different development teams. After assembly, the entire system should work.

How to roll out such changes and not screw it up? At first we decided to look a little bit into the future.

Rolling out strategy

It would be possible to roll out in several stages, but this would increase the development time by a factor of three. In addition, for some time we would have a complete desynchronization of data in the databases. I would have to write my own synchronization tools and live a long time with several data warehouses. And this creates a variety of risks.
Everything that could be prepared transparently for the user was done in advance. It took 2 months.
We allowed ourselves downtime for a few hours - only on user operations to create and modify resources.
For the operation of all already created resources, downtime was invalid. We planned that when rolling out resources should work without downtime and affect for customers.
To reduce the impact on our customers, if something goes wrong, we decided to roll out on Sunday evening. At night, fewer customers manage virtual machines.
We warned all our customers that the service management will be unavailable during the period chosen for rolling out.

Retreat: what is rolling out?

<careful philosophy>

Each IT specialist will easily answer what rolling out is. You put CI / CD, and everything is automatically delivered to the prod. :)

Of course, this is true. But the difficulty is that with modern tools for automating the delivery of code, the understanding of rolling out itself is lost. How do you forget about the epic invention of the wheel, looking at modern transport. Everything is so automated that the rollout is often carried out without awareness of the whole picture.

And the whole picture is as follows. The rollout consists of four large aspects:

Delivery code, including data modification. For example, their migration.
Rollback code - the ability to return if something goes wrong. For example, through the creation of backups.
The time of each roll out / rollback operation. It is necessary to understand the timing of any operation of the first two points.
Affected functionality. It is necessary to evaluate both the expected positive and possible negative effect.

All these aspects must be considered for successful rolling out. Usually, only the first one is evaluated, at best the second point, and then the rollout is considered successful. But the third and fourth are even more important. Which user will like it if the rollout takes 3 hours instead of a minute? Or if there is something superfluous on the rollout? Or downtime one service will lead to unpredictable consequences?

Act 1..n, preparation for release

At first I thought to briefly describe our meetings: the whole team, its parts, heaps of discussions in coffee points, disputes, tests, brainstorms. Then I thought it would be superfluous. Four months of development always consist of this, especially when you write not something that can be delivered all the time, but one big feature on a live system. Which affects all services, but users should not change anything, except for "one button in the web interface."

Our understanding of how to roll out, changed from each new meeting, and very significantly. For example, we were going to update our entire billing database. But they counted the time and realized that it was impossible to do it in a reasonable time. They took an extra week for extra charge and archiving the billing base. And when the expected roll-out speed didn’t work after that, they ordered an additional, more powerful hardware, where they dragged the entire base. Not that we did not want to do this before, but the current necessity of rolling out did not leave us any options.

When one of us had doubts that rolling out could affect the availability of our virtual machines, we spent a week testing, experimenting, parsing the code and getting a clear understanding that this would not happen on our production, and even the most hesitant people agreed.

Meanwhile, the techsport guys conducted their independent experiments to write to customers instructions on how to connect, which should have changed after rolling out. They worked through a custom UX, prepared instructions, and provided personal advice.

We have automated all rollout operations that were possible. Any operation was scripted, even the simplest, constantly drove tests. They argued about how best to turn off the service — lower the daemon or close access to the service with a firewall. We created a checklist of teams for each stage of rolling out, constantly updated it. We drew and constantly updated the Gantt chart for all roll-out works, with timings.

And so…

Final act, before rolling out

... it's time to roll out.

As they say, a work of art can not be completed, just finish working on it. It is necessary to make a volitional effort, knowing that you will not find everything, but believing that you made all reasonable assumptions, provided for all possible cases, closed all critical bugs, and all participants did everything they could. The more code you roll out, the harder it is to convince yourself of this (besides, anyone understands that it is impossible to foresee everything).

We made the decision that we were ready to roll out when we were convinced that we had done everything possible to close all the risks for our users associated with unexpected affects and downtime. That is - anything can go wrong except:

Affect (sacred for us, the most precious) user infrastructure,
Functionality: the use of our service after vykatki should be the same as before it.

Rolling out

Two roll, 8 do not interfere

We take downtime to all requests from users within 7 hours. For this time we have both a rollout plan and a rollback plan.

The rollout itself takes about 3 hours.
2 hours - for testing.
2 hours - reserve for possible rollback of changes.

A Gantt chart is drawn up for each action, how long it takes, what goes in series, what happens in parallel.

A piece of Gantt chart roll-out, one of the earlier versions (without parallel execution). The most valuable synchronization tool

All participants have their role in rolling out, what tasks they do, what they are responsible for. We try to bring each stage to automatism, roll out the rollback, collect feedback and roll again.

Chronicle of events

So, 15 people came to work on Sunday, April 28, at 10 pm. In addition to the key participants, some came just for the support of the team, for which they had a special thanks.

We should also mention that our key tester is on vacation. Rolling out without testing is impossible, we are working on options. A colleague agrees to potest us from vacation, for which she immense gratitude from the whole team.

00:00 Stop
We stop user requests, we hang up the nameplate, they say, technical work. Monitoring screams, but all is normal. We check that nothing fell, except for what should have been. And we begin work on migration.

Everyone has a printed plan for rolling out the items, everyone knows who does what and at what point. After each action, check the timings that do not exceed them, and everything goes according to plan. Those who do not participate in the rollout directly at the current stage, prepare by launching an online toy (Xonotic, type 3 Quake), so as not to interfere with colleagues. :)

02:00 Rolled out
A pleasant surprise - we finish the rollout one hour earlier, due to the optimization of our databases and migration scripts. The universal cry, "rolled out!" All the new features in the sale, but so far only we see in the interface. All go into test mode, sort out the piles, and begin to look at what happened.

It turned out not very much, we understand it in 10 minutes, when nothing is connected and does not work in the projects of the team members. Quick sync, we voice our problems, set priorities, break into teams and go to debug.

02:30 Two big problems vs four eyes
We discover two big problems. Understood that customers will not see some of the connected services, and there will be problems with the accounts of partners. Both are related to the imperfection of migration scripts for some marginal cases. It should be fixed now.

We write requests that fix it, at least 4 eyes. We roll on the prefet to make sure that they work and do not break anything. You can roll on. In parallel, our usual integration testing is launched, which reveals a few more problems. All of them are small, but also need to be repaired.

03:00 -2 problems +2 problems
The two previous big problems are fixed, almost all the minor ones too. All those unoccupied in fixes are actively working in their accounts and report what they find. We prioritize, we distribute by teams, we leave the non-critical in the morning.

Again we run tests, they reveal two new big problems. Not all service policies arrived correctly, so some user requests are not authorized. Plus, a new problem with the accounts of partners. We rush to watch.

03:20 Emergency sync
One new issue fixed. For the second, we arrange an emergency sync. We understand what is happening: the previous fix fixed one problem, but created another. We take a pause to figure out how to do it correctly and without consequences.

03:30 Six eyes
We are aware of what the final state of the base should be, so that all partners have everything well. We write a request in 6 eyes, we roll on the predoct, we test, we roll on the prod.

04:00 Everything is working
All tests have passed, critical problems are not visible. From time to time in the team something does not work for someone, we promptly react. Most often the alarm is false. But sometimes something did not reach, somewhere a separate page does not work. Sit, fix, fix, fix. A separate team launches the last big feature - billing.

04:30 Point of no return
The point of no return is approaching, that is, there is a time when, if we start to roll back, we do not keep within the given downtime. There are problems with billing, which knows everything and writes everything, but stubbornly does not want to write off money from customers. There are several bugs on individual pages, actions, statuses. The main functionality works, all tests pass successfully. We make the decision that the rollout took place, we will not roll back.

06:00 Open at all in UI
Bugs fixed. Some, not affecting users, are left for later. We open the interface to all. We continue to conjure over billing, waiting for feedback from users and monitoring results.

07:00 API Load Issues
It becomes clear that we have a little wrong plan out the load on our API and testing this load, which could not identify the problem. As a result, ≈5% of requests fail. Mobilizing, looking for the cause.

Billing persistent, also does not want to work. We decide to postpone it for later in order to make changes in a quiet mode. That is, all the resources in it are piling up, but customers do not write off. Of course, this is a problem, but compared to the overall rollout, it does not seem fundamental.

08:00 Fix API
Rolled out a fix for the load, the feils are gone. We are starting to go home.

10:00 Everything
Everything is fixed. In monitoring and with customers quietly, the team gradually goes to sleep. There is a billing, we will restore it tomorrow.

Further, during the day, there were rollouts, which were repaired by logs, notifications, return codes and custom codes from some of our clients.

So, rollout was successful! It could be, of course, better, but we drew conclusions about what we lacked to achieve perfection.

Total

Within 2 months of active preparation, 43 tasks were completed for a couple of hours to several days.

During rolling out:

New and modified demons - 5 pieces, replacing 2 monoliths;
Changes within the databases - all 6 of our databases with user data are affected, unloading from three old databases into one new one has been performed;
completely reworked frontend;
the number of code rolled out - 33 thousand lines of new code, ≈ 3 thousand lines of code in tests, ≈ 5 thousand lines of migration code;
All data is intact, not a single virtual client has suffered. :)

Good practices for good roll-out

We were guided by them in this difficult situation. But, generally speaking, they are useful to comply with any rolling out. But the harder rollout, the greater the role they play.

The first thing to do is understand how rollout can affect or affect users. Will it be downtime? If so, what is downtime? How will this affect the users? What are the best and worst scenarios possible? And close the risks.
Plan everything. At each stage, you need to understand all aspects of rollout:
- code delivery;
- code rollback;
- time of each operation;
- affected functionality.
Play scenarios until all stages of rolling out become obvious, as well as risks on each of them. If there is any doubt, you can take a pause and explore the dubious stage separately.
Each stage can and should be improved if it helps our users. For example, reduce downtime or remove any risks.
Testing rollback is much more important than testing code delivery. It is necessary to verify that as a result of the rollback, the system will return to its original state, confirm this with tests.
Everything that can be automated should be automated. Anything that cannot be automated must be written on the cheat sheet in advance.
Record success criteria. What functionality should be available and at what time? If this does not happen, start the rollback plan.
And most importantly - people. Everyone should be aware of what he is doing, why and what depends on his actions in the process of rolling out.

And if in one phrase, with good planning and elaboration, you can roll out without consequences for selling anything. Even the fact that affect all your services in the sale.

Source: https://habr.com/ru/post/453364/

All Articles