Friday to Monday Night: How We Launched Skyforge

As many of you know, on March 26, the Allods Team (Mail.Ru Group studio) launched the open beta test (MBT) of the Skyforge MMORPG project. My name is Sergey Zagursky, I work as a server team and I want to talk about how the launch took place, what incidents we encountered and how the winners got out of the situation.

Early Phase MBT

On March 26, MBT was launched for owners of early access kits. During the week they had exclusive access to the game, and they could use it to get a small handicap in front of other players. This phase did not bring us big surprises, because loads, in general, were comparable to loads on the closed beta test and stress test. The fun started later ...

Access is open to all

On April 2, on Thursday, in the course of routine prophylaxis, the tick "Entry is allowed only for owners of early access kits". Despite the delay in entering the game, everything went smoothly enough for the first day. There was a night prophylaxis dedicated to the elimination of iron / software configuration roughness.
')

Friday!

The fun began on Friday, April 3rd. In the morning, after the end of night prophylaxis, the players began to fill the server capacities. 20 minutes after the start of the servers, we passed the load on the stress test and entered uncharted territory. Of course, we carried out many times more extensive synthetic stress tests, but the player is an unpredictable creature from which we can expect button presses that the bots have not learned to press yet. The first signs of future difficulties did not take long to wait - just 2 hours after opening, the server for the first time stopped users from entering the game.

One of the main focuses in server development is quality of service. Therefore, we have developed several mechanisms to evaluate it for players already in the game. If this estimate falls below a certain threshold, then we suspend the entry of new players, guided by the consideration of "less is more and better." In two hours after the start, this indicator was exceeded for half a minute, and the players began to watch the queue at the entrance, which was promoted only by those who decided to leave it. Many of the developers were at this time in the game and were somewhat perplexed, since according to subjective feelings, the quality of service was not affected. There were several reasons for this. Load analysis revealed several nodes with abnormal load. The service that serves the in-game market was the most heavily loaded. In the game itself, the market felt like it worked with an acceptable delay. The second in load were several nodes that store the data of game characters. It requires a small excursion into the recent history of these nodes.

At the closed beta test stage, we were somewhat concerned about the risks that our main nodes with databases might not overpower the load. Therefore, shortly before the start of the MBT, we used several additional machines for the bases. By configuration, they fit in everything except the disk subsystem. Therefore, SSD-disks from bins were placed on these machines and connected to the server stand in a new incarnation.

The peculiarity of character balancing on the DB nodes played a small joke with us. The specific node number on which the player’s character will be stored is determined at the time when the user creates an account. Even if this account was created from the game portal. It is not difficult to guess that all the players who showed interest in the game, entered the game portal and registered an account there, were balanced on the “old” nodes. Among them were all the developers. Subsequently, this prevented us from making a picture of the quality of service from the player’s point of view, since On these nodes, the load was within the normal range. Of course, we knew about these features beforehand and took steps to ensure that new users were registered to the “old” nodes only after the load on the “old” and “new” nodes was equalized. So in our game there appeared, as we called them, nodes “for old-timers”, on which all owners of early access kits lived.

Let's return to the chronicle. As I already said, not all nodes with bases were loaded, but only a few. Moreover, at the “for old-timers” nodes, the quality of service was higher, despite the fact that there were more players at that moment. At this time, the server closed its doors for the N-th time, and we, guided by the fact that nothing was lagging inside the game, decided to adjust the maximum load limit, after which the server pauses the players' entry into the game. There were several such corrections, each of which temporarily improved the situation with entry into the game. We tried to find the value at which the load stabilizes. After the next correction, the load on the nodes with the database stabilized, and the market continued to degrade.

It happened. Dashboard reported that the server began an emergency shutdown procedure. A brief investigation revealed that the server was extinguished in strict accordance with the algorithm laid down in it. What is called, by design.

In the distributed game server system there are many nodes that are responsible for storing certain data about the characters and the game as a whole in the database. The quick failover strategy includes the periodic creation of consistent restore points on all nodes with databases. Once in a while, a special service coordinates the creation of these restore points and notifies of problems with their creation. The system is configured in such a way as to exclude the rollback of the characters' progress in case of data integrity violation for more than 10 minutes. In our case, the service reported the impossibility of creating restore points for the market database and initiated the stopping of the game server. In the course of involuntary prophylaxis, it was decided to deactivate the gaming market. The most reliable solution was to give the client empty lists of possible operations. We filled in the patch that made these lists empty with hotswap immediately after the restart.

In the process of all this, the cause of the abnormal load on some nodes that store the data of game characters was found out. Due to the fatal confluence of circumstances on the new nodes with the bases disks of lower performance were installed. Replacement discs were found, a replacement was scheduled for scheduled maintenance on Monday. Also, colleagues managed to make a patch that corrects the anomalous load on the market, the layout of which was also planned on Monday.

After restart

The situation seemed to stabilize. The load on some DB nodes was high, but stable. Most of the players saw this as delays when raising loot, operations with adepts and development graph development. The market was completely disabled, and it blocked a significant part of the user experience. A more in-depth analysis showed that, in terms of load, sales operations have a much greater weight, and we, using the same hotswap, have included the possibility of buying in the market. The results of the analysis did not deceive us, the market became responsible for an acceptable time.

The server worked, the team was going to hold a small corporate event at 19:00, dedicated to the launch of the MBT. And here in the logs, suspicious messages from the database skip, telling about the violation of the integrity of databases on nodes with less productive disks. A small dialogue with the operating team, checking the status of the replicas - and at 19:00, during the official part of the corporate party, with shouts of “Hurray!” And a splash of champagne, suspicious nodes are sent for prevention.

Pg_dump on replicas and the wizard showed the same disappointing error message. Everything pointed to the loss of part of these game characters. Plans change dramatically. Routine prevention is postponed from Monday to night from Friday to Saturday. Urgently leaves the courier to deliver the SSD to the desired data center. Quickly start building a new version of the server, which includes all fixes and optimizations made at that time. We, gray-haired, sit and disassemble the failed bases.

The analysis showed that the integrity of not all bases was violated. The operator has reconfigured the server so as to start without corrupted databases. A calmer and deeper analysis revealed that, due to Postgres' failure, only the indices were corrupted, which, to everyone's satisfaction, were correctly recreated.

New crashes

At the moment when I approached the operator’s team with this news, they remotely coordinated the replacement of slow disks with normal ones. Telemetry shows the failure and the drop in the number of players (yes, in Skyforge all this time those players whose characters were on the included nodes continued to play). Employee error At 3 o'clock in the morning the administrator on duty pulled the disks not from the 12th, but from the 11th unit. In the 11th unit there is a node with an authentication service. Hysterical laughter.

Fortunately, the RAID from which the disks were pulled out did not crumble, and this incident only slightly extended the work. By the way, until the server was completely stopped, there were 6 people on it who were playing. After stopping the authentication service, you can continue to play until the transition to another card occurs (which, by the way, is a bug, and in future versions the transition will not depend on the authentication service).

Run at 7 am

In the morning the server was running on all bases. The results of these nightly professional jobs were: a) replacing slow SSDs with regular ones on some nodes, b) temporarily transferring synchronous replicas to a RAM disk in order to reduce the load on SSDs. The load on the nodes that store the character data no longer caused concern.

Shaw, again?

At 1:45 pm Saturday, the load on the nodes with the data of game characters flies to infinity, and the server crashes when the restore point service already known to us stops. Reason: RAM disks with synchronous replicas overflowed, and nodes were locked on the commit. Since at that time we already knew that there was no urgent need for RAM disks for replicas, then during unscheduled preventive maintenance we switched synchronous replicas back to SSD and restarted the server. On Saturday, it also turned out that in the heat of assembling a new version of the server, optimization on the market did not get into it, so the sale on the market had to be turned off again.

Let's sum up

Top problem areas, because of which we spent more time than we could:

Synthetic load on the load tests of the market was smeared across all positions on the market, but in reality there was a serious contention on one of the positions. This led to an abnormally high service market load.
Underestimation of the possibility of failure from the side of iron and RDBMS. It was a surprise to us and, if we were prepared, unplanned prophylaxis could have been avoided altogether.

Regarding what we, in the context of the above-mentioned adventures, experience pride:

The hotswap toolkit allowed us to respond quickly in various situations. This concerned both the on / off feature of the game, and the diagnostics extension without stopping the server.
The server has confirmed its stability when individual nodes are disabled / dropped.
Diagnostics and automatic reports on the server operation helped us a lot in detecting and determining the causes of failures.

Source: https://habr.com/ru/post/256155/

All Articles