“Hurray, we were checkin!” Or How to change the data center under load and without downtime when everything goes to hell

A couple of years ago we were located in the most cost-effective (read: "cheap") data center in Germany. So that you understand the conditions - routing could fail from a rack to a rack or inside it; switch in rack was overloaded or hung; the data center itself is constantly ddosili; hard drives failed; motherboards and network cards burned; the servers randomly shut down, rebooted, and the network cables fell out like autumn leaves during a hurricane.

Periodically, when it was time to scale horizontally, the place also ended up at the DC, and we were offered a different location, in another city, which was unacceptable for our conditions (data scheme limitations, cluster topology and criticality of client waiting time).
')
The boiling point came and we decided to move. Although at some point it even seemed that it was cheaper to hire more maintenance personnel to manage the situation in the aforementioned DC. But in the end, in order to “make life better”, we chose stability.
The choice focused on a data center in the Netherlands, in Amsterdam. And here is the most interesting thing: by that time, the game already had a decent DAU, the move needed to be done online, without downtime, simultaneously on both platforms (Android and iOS). Moreover, we received a feature on Google Play, marketing also launched an advertising campaign. As you know, additional traffic has become very, very much.

In general, the task is not the most trivial, and this is how we coped with it.

Common architecture

The first mobile synchronous PvP-shooter for a company that deals mainly with casual games for social networks is like a step into the unknown. Our guys have already written about how the architecture of the War Robots project evolved at the very beginning, and we — development teams and system administrators — had their own Chelendji.

So, our common architecture:

In order:

Frontend-Backend. Frontend: nginx; Backend: apiserver (tomcat) + hazelcast.
Fault tolerance. We could (before the move) and now we can lose only two cassandra nodes from the cluster without consequences. Had from 20 to 30 nodes on the platform.
Replication factor. RF = 5.

Purpose of moving

Firstly, we wanted to change the DC to make the game more qualitative. Availability and responsiveness of services is one of the most important components of reliability. But secondly, it was time to update the iron.

But to achieve the secondary goal - to move to a fresh Cassandra - unfortunately, did not succeed. Versions 2.1.13 and 2.1.15 in the test conditions did not want to communicate normally with each other. An investigation was conducted, attempts to understand why, but deadlines were tight, so they left the version as it was. Actually, therefore, more recent versions of the cassandra did not threaten at all.

What faced

Since the preparation and testing phase was delayed, and we came under increased load (Google Play’s feature and marketing campaign launch), it was therefore necessary to ensure that the operation was extremely reliable.

The growth of DAU during the move to a new DC with simultaneous advertising campaigns and features.

At its core, relocation is first adding a second DC to the cluster, switching traffic, and then deleting the first. We transported not only the code, but all the players' data. The distance from the application to the database should be minimal. For consistency, you need to constantly write in both rings, choosing the optimal consistency level, which imposes transport costs for data transfer to a remote data center for the response time. It was necessary to ensure a good speed of access to the database in the new DC (a wide and reliable channel of communication between the DC).

The old DC had very stable problems with hardware and the network. RF was supposed to save from the sudden failure of iron. With the possibility of a cluster collapsing due to the network, it was possible to fight by monitoring the process of driving the data, and in case of which, pumping the data several times.

Regarding the choice of a new data center, in short, it should have been more stable in terms of the internal network and communication with the outside world and more reliable in terms of hardware. We were ready to pay more for better quality and real, not imaginary, availability of servers 24/7.

Preparing for the move

For this we needed to do quite a lot, namely:

There were servers on which the Cassandra code and nodes were simultaneously spinning. It was necessary to unload the servers where the backend and cassandra nodes were spinning simultaneously, leave only the tomkats on those machines, demolish the cassandra nodes there, and instead deploy the nodes on separate machines (goal: in the case of loads on cassandra during the move and the influx of players, the use of resources, you need to give the database everything that is on the machine).
Unfortunately, it so happened that the old ring worked on SimpleSnitch. It is not flexible, it does not allow the use of several DCs to the full ( in terms of cassandra ). GossipingPropertyFIleSnitch is recommended for a product with its hardware. It allows you to dynamically control the ring through the settings and the gossip-er protocol. Change snitch (SimpleSnitch -> GossipingPropertyFileSnitch) and topology (SimpleStrategy -> NetworkTopologyStrategy). Instructions for adding a DC and changing the configuration of the database have not been invented, everything has been in the documentation for a long time.
To test the move, because before that, nothing was done. To do this, they took the server separately, wrote the code that creates the load, and during the load, changed the settings, added DC to the cassandra, pumped the data, switched on the load on the new DC and checked the data in the new ring. At this stage, we tried to update the cassandra, but after the update, the rings stopped seeing each other and were spat errors. And tried to update both rings. The ambush was somewhere in this version of cassandra. Also here we tried to break the connection between the rings and observe how the cluster and the rebuild behave. After that, a separate code checked that we tightened all the data.
To think over options of recoil and spare options when the stage of no return comes.
Take a fresh and powerful iron, because it was planned to increase the load (here, by the way, SSD was placed under cassandra). It was necessary to calculate the volume of data and take the disks with a margin, because Cassandra requires X2 or more rebuild space. It all depends on the strategy of compaction. You should always assume that there is not less than X2.
Prepare settings for new cars. Simultaneously with moving to another DC, we decided to move from Puppet to Ansible. In the framework of the same point, we planned to clean up the infrastructure, hone the configuration of servers and the rapid addition of new hardware to the system. Also needed to improve monitoring.

How to do

* DC1 - old DC, DC2 - new DC.

Expand the backend to DC2 in a disconnected view, with write CL = EACH_QUORUM. Why EACH_QUORUM? If you have to return to DC1, then there is less chance that the data will be lost, because they fall into both DCs.
Expand empty cassandra set settings:
auto_bootstrap = false
seeds = for now the list of seeds from the old DC
endpoint_snitch = GossipingPropertyFileSnitch
dc = DC2
Start the ring of the new DC, check that the nodetool status in both DC shows that everything is fine: all nodes are raised (UN), there are no errors in the DB logs.
Configure keyspace (declare that the data will be in DC1, DC2):
UPDATE KEYSPACE WarRobots with placement_strategy = 'NetworkTopologyStrategy'
and strategy_options = {'dc1': 5, 'dc2': 5};
Enable EACH_QUORUM on write to DC1 backend.
Run a data rebuild on DC2 with an indication of where to get the data for DC2:
nodetool rebuild dc1
After the rebuild is finished, you need to run the code in DC2, connect with clients and test.
Consistently under surveillance, transfer nginx traffic to the API in DC2.
Watch and throw DNS.
Wait for the full transfer of traffic to DC2, turn off the backend to DC1, transfer the DC2 backend to writeCL = LOCAL_QUORUM ( point of no return ).
Disable DC1 from keyspace War Robots.
Disassemble the ring DC1.

How we did and why

A backend was unfolded (in a disconnected form, with write CL = EACH_QUORUM).
Deployed empty cassandra, configured.
Launched a DC2 ring.
Set up keyspace.
Complete full data rebild on DC2 with indication of DC1 source.
Launched backend on DC2.
Connected by the client, checked that everything is fine.
Enabled EACH_QUORUM to write to the DC1 backend.
Rebild data on DC2 started again with instructions to take data from DC1. We had problems with the network and the iron in the process of moving, so they considered it necessary to carry out the second rebuild.
After the end of the rebuild, nginx traffic was thrown on DC2 in succession.
Observed and threw DNS.
They waited for the full transfer of traffic to DC2, turned off the backend to DC1, transferred the DC2 backend to writeCL = LOCAL_QUORUM ( point of no return )
Disconnected DC1 from keyspace War Robots.
Defected the DC1 ring.

Risks in our implementation

We could go to downtime because of the high load on the base on the old ring (to be honest, it was not in the best condition). We could also potentially lose something in the course of data transfer in the event of a network failure - partly because we decided to play it safe and spend two rebuilds.

Positive sides

We could switch back to the old DC at any time, right down to the point of no return. After the point of no return, too, could, but then we would lose progress on the players in a few days. Although we could pour it back from DC2 to DC1.

We also had the opportunity to test that everything is fine with both cassandra DCs and data flowing both ways.

What moments we have not calculated and how to deal with the consequences

In the process of moving, the node fell on DC1 (a screw fell down, local data of the node became corrupted), I had to cut it down and move with one spare node in the ring.

We also didn’t have enough space for the iOS platform to re-rebuild, we had to quickly buy and expand disks. We tried to use servers with HDD to smear the data of the DC2 ring and not to buy more disks (buying disks is time), but it drastically reduced the speed of data transfer. So I had to wait for the delivery of the disks.

Result

As a result, the graphs will most graphically be shown (a sharp drop in the graphs is associated with short-term problems with analytics, rather than downtime, as it may seem):

The average response time to a client request to download a player profile.

The average response time of the database.

Unfortunately, indicators of iron load are not preserved. From memory, Cassandra ate about 60% of the CPU in the old DC. At the moment, the value is kept at 20% (with the fact that DAU, during periods of high DAU and the load can rise to 40%).

Results

Oddly enough (since anything could have happened, right up to the classical cutting of the main data center cable), we moved. In fact, during the operation we did not invent anything. Everything, in general, is written in the official documentation and other open sources. However, over time, we revised what was done and, in general, we can say that everything was done quite well: a plan was developed and tested, work was done well, the game world did not suffer in the process, the data was not lost. Of course, after the move, much more was polished and had to be done, but that is another story.

At the current moment, the availability and reliability of the game world and these players has grown significantly:

we accelerated the work of services due to a more powerful, reliable hardware and transfer of database data to SSD;
reduced delays in responses to the client due to the location of the database servers and the code as close as possible to each other within the DC;
replaced the infrastructure management system, improved monitoring;
Gained invaluable experience in managing Cassandra cluster and data migration in combat conditions to another data center;
and, of course, the number of hours of night sleep at server developers and system administrators has returned to normal.

Source: https://habr.com/ru/post/348060/

All Articles