📜 ⬆️ ⬇️

“Hurray, we were checkin!” Or How to change the data center under load and without downtime when everything goes to hell



A couple of years ago we were located in the most cost-effective (read: "cheap") data center in Germany. So that you understand the conditions - routing could fail from a rack to a rack or inside it; switch in rack was overloaded or hung; the data center itself is constantly ddosili; hard drives failed; motherboards and network cards burned; the servers randomly shut down, rebooted, and the network cables fell out like autumn leaves during a hurricane.

Periodically, when it was time to scale horizontally, the place also ended up at the DC, and we were offered a different location, in another city, which was unacceptable for our conditions (data scheme limitations, cluster topology and criticality of client waiting time).
')
The boiling point came and we decided to move. Although at some point it even seemed that it was cheaper to hire more maintenance personnel to manage the situation in the aforementioned DC. But in the end, in order to “make life better”, we chose stability.
The choice focused on a data center in the Netherlands, in Amsterdam. And here is the most interesting thing: by that time, the game already had a decent DAU, the move needed to be done online, without downtime, simultaneously on both platforms (Android and iOS). Moreover, we received a feature on Google Play, marketing also launched an advertising campaign. As you know, additional traffic has become very, very much.

In general, the task is not the most trivial, and this is how we coped with it.

Common architecture


The first mobile synchronous PvP-shooter for a company that deals mainly with casual games for social networks is like a step into the unknown. Our guys have already written about how the architecture of the War Robots project evolved at the very beginning, and we — development teams and system administrators — had their own Chelendji.

So, our common architecture:



In order:


Purpose of moving


Firstly, we wanted to change the DC to make the game more qualitative. Availability and responsiveness of services is one of the most important components of reliability. But secondly, it was time to update the iron.

But to achieve the secondary goal - to move to a fresh Cassandra - unfortunately, did not succeed. Versions 2.1.13 and 2.1.15 in the test conditions did not want to communicate normally with each other. An investigation was conducted, attempts to understand why, but deadlines were tight, so they left the version as it was. Actually, therefore, more recent versions of the cassandra did not threaten at all.

What faced


Since the preparation and testing phase was delayed, and we came under increased load (Google Play’s feature and marketing campaign launch), it was therefore necessary to ensure that the operation was extremely reliable.


The growth of DAU during the move to a new DC with simultaneous advertising campaigns and features.

At its core, relocation is first adding a second DC to the cluster, switching traffic, and then deleting the first. We transported not only the code, but all the players' data. The distance from the application to the database should be minimal. For consistency, you need to constantly write in both rings, choosing the optimal consistency level, which imposes transport costs for data transfer to a remote data center for the response time. It was necessary to ensure a good speed of access to the database in the new DC (a wide and reliable channel of communication between the DC).

The old DC had very stable problems with hardware and the network. RF was supposed to save from the sudden failure of iron. With the possibility of a cluster collapsing due to the network, it was possible to fight by monitoring the process of driving the data, and in case of which, pumping the data several times.

Regarding the choice of a new data center, in short, it should have been more stable in terms of the internal network and communication with the outside world and more reliable in terms of hardware. We were ready to pay more for better quality and real, not imaginary, availability of servers 24/7.

Preparing for the move


For this we needed to do quite a lot, namely:


How to do


* DC1 - old DC, DC2 - new DC.


How we did and why



Risks in our implementation

We could go to downtime because of the high load on the base on the old ring (to be honest, it was not in the best condition). We could also potentially lose something in the course of data transfer in the event of a network failure - partly because we decided to play it safe and spend two rebuilds.

Positive sides

We could switch back to the old DC at any time, right down to the point of no return. After the point of no return, too, could, but then we would lose progress on the players in a few days. Although we could pour it back from DC2 to DC1.

We also had the opportunity to test that everything is fine with both cassandra DCs and data flowing both ways.

What moments we have not calculated and how to deal with the consequences

In the process of moving, the node fell on DC1 (a screw fell down, local data of the node became corrupted), I had to cut it down and move with one spare node in the ring.

We also didn’t have enough space for the iOS platform to re-rebuild, we had to quickly buy and expand disks. We tried to use servers with HDD to smear the data of the DC2 ring and not to buy more disks (buying disks is time), but it drastically reduced the speed of data transfer. So I had to wait for the delivery of the disks.

Result


As a result, the graphs will most graphically be shown (a sharp drop in the graphs is associated with short-term problems with analytics, rather than downtime, as it may seem):


The average response time to a client request to download a player profile.


The average response time of the database.

Unfortunately, indicators of iron load are not preserved. From memory, Cassandra ate about 60% of the CPU in the old DC. At the moment, the value is kept at 20% (with the fact that DAU, during periods of high DAU and the load can rise to 40%).

Results


Oddly enough (since anything could have happened, right up to the classical cutting of the main data center cable), we moved. In fact, during the operation we did not invent anything. Everything, in general, is written in the official documentation and other open sources. However, over time, we revised what was done and, in general, we can say that everything was done quite well: a plan was developed and tested, work was done well, the game world did not suffer in the process, the data was not lost. Of course, after the move, much more was polished and had to be done, but that is another story.

At the current moment, the availability and reliability of the game world and these players has grown significantly:

Source: https://habr.com/ru/post/348060/


All Articles