📜 ⬆️ ⬇️

Behind the Scenes of Skyforge Closed Beta



Today I want to talk about the first part of the closed beta (PTA) of Skyforge. This is not the first PTA, but it has become the most widespread. Most of the players are not employees of the company and their friends, but fans of the game, selected randomly from among those who registered for the test, as well as bought or won early access kits. From Friday evening, February 6, until the end of the weekend, a special beta weekend will be organized, during which access to the PTA will be open to all users who have an account in Mail.Ru mail. This post is of a narrative nature and conveys my personal point of view on the events that occurred before and during the PTA.

PTA preparation


The very first Skyforge tests were carried out exclusively within the development team, they did everything on their own. Over time, the game became more stable, respectively, and the circle of persons admitted to testing expanded. First, we showed the game to colleagues in the Allods Team. Then - the game department Mail.Ru Group. Then - the whole team of the company. Starting from this stage, all the tests have already been conducted by a separate team - the operating team. External tests went on the stand, which was preparing for the PTA and, subsequently, MBT.

In order for closed beta testing to be productive, we carefully prepared our infrastructure. We conducted load tests on the PTA servers, set up collecting various metrics, receiving crashes, analyzing logs and much more. And when everything was ready, we opened the valve and let in the first real users.
')

First moments


As soon as the players started to enter, we immediately began to look for a broadcast on Twitch, to see the reaction of users, to understand how well the server is doing, and just out of curiosity. Most of the studio watched the broadcast of an unsuspecting gamer, and even the director of quality gave him some advice as a regular user. We started receiving the first reports from the combat servers - problems with authorization on the web portal, client crash, connection errors of individual players. But, in general, the start was successful: the account server managed, and the game mechanics servers did not slow down. That, however, is not surprising, because the power of iron, according to our data, should have been enough for a much larger number of users.

Patch live


We are lucky that Skyforge server is written in Java. This language natively supports the HotSwap mechanism, hot-swapping code without stopping the application. We even wrote a special utility that runs over them and patches. This tool was useful to us already during the start of the PTA. When we realized that we had a bug when loading user avatars for chat, we just commented out this functionality.



For users, the loss of the avatar went almost unnoticed, but some servers have become much easier to live. In the next patch, this functionality was turned off immediately.

In principle, the technique of point patching is as abrupt as it is terrible, and without complete confidence in success it is better not to use it.

Customer difficulties


From the point of view of testing, the server, fortunately, was more fortunate than the client. We know the hardware specification on which the server will work, we know the exact environment, we set the settings ourselves. The client in this regard is much more difficult. During development, we have 10-20-30 different configurations of the PC on which the client runs. These are the computers of the developers themselves. After the release of the number of configurations goes to thousands. And each of them may have its own characteristics: exotic drivers, video cards and other components. Therefore, the first wave of new players has fallen victim to massive crashes and performance problems. But thanks to our system for collecting statistics on hardware and performance, we managed to quickly make the necessary improvements. Now the whole work of the team is also aimed at improving stability and increasing FPS.



It is for such heatmaps that we estimate FPS players in open areas.

Server difficulties


Testing outside the circle of friends and family members introduced the server team to some oddities of the real world. For example, several users suffered from the peculiarities of their network infrastructure - the new connection got a new external IP. Thus, when a player teleported from one card to another, his IP changed. And we honestly disconnected it from the server, since considered this situation invalid. But after several set tickets, I had to disable IP checking in one gaming session as unnecessarily paranoid. We have left other checks of validity of the user.

If you read the article about the architecture of our server, then you know that it consists of a set of distributed servers, each of which lives on its host and performs a certain role: an authorization server, a game mechanics server, a database server, and so on. When developing, we tried to make it so that the crash of a single server was not fatal and most users could continue to play. But the reality turned out to be tougher than we thought.



Sometimes servers fall in their entire racks. The first time we were “lucky” was when the rack with the portal and several servers of game mechanics completely de-energized. We believed that the players would behave this way: those who were not lucky enough to be on de-energized servers quietly re-enter the game and continue to play further, however, having lost the progress of passing personal cards that remained on the deceased mechanics. Unfortunately, since the shutdown was abnormal and the mechanic hosts disappeared completely, the coordinator server did not receive a disconnect message. For me personally, this was not the most expected TCP keep-alive behavior. It turned out that keep-alive starts sending packets on inactive TCP connections after some (by default, very long) timeout has elapsed. This is done in order not to litter the channel, when all is well. In principle, it was possible to set the timeout value for each connection individually, but this would require dirty hacks. Therefore, we agreed to make our decision: a simple mechanism for ping-pong polling servers. Its plus is that admins in remote territories will not be able to accidentally disable it, unlike TCP keep-alive, and we can disable servers that have gone, for example, to the Full GC series.

After we discussed this idea, we agreed to implement it before the MBT. Still, not every day the hosts are de-energized.



When we saw the above schedule in real time, we thought that a sharp drop in the number of active users is a sign of the end of the working day. But it turned out that the host with the mechanics number 13 fell. Two falls in a week are still a coincidence. But already a significant part of the server team is studying the logs, where and what can be improved in the conservatory. And here, on the last Friday of the year, a few hours before the corporate party, the host with mechanics number 13 falls again. And ahead of the New Year and 12 days of holidays that I want to spend away from work. Decided to turn off the unlucky mechanics. And since then there have been no more such incidents. But in order for the New Year to pass more calmly and the fall of individual mechanics did not lead to urgent preventive work, we prepared the first version of the fix on the same corporate evening. We tested it to be honest: we ran a distributed version of the server on local PCs and pulled the power cord out of the mechanical part. Fix was recognized as a worker and went to combat during the last maintenance work in 2014. New Year's holidays were calm in the end.

Error 107


107 - this is exactly the code for the timeout error when connecting to the server. This mistake, unfortunately, has become widely known among the players involved in the PTA. And several rather interesting facts are connected with it at once.

The bug, or even the bugs that lead to error 107, were included in the network engine for a long time - a few months before the start of the PTA. And the first about the timeout bots stumbled. But then there was either a shortage of time, or I couldn’t figure out the reasons, but in the end, the bots just turned off all kinds of timeouts. As a result, the bug, unfortunately, reached the battle.

Then we witnessed how several active users can create the illusion of the importance of a particular problem. Error 107, as the study showed, could occur only among users who have a ping in the region of ~ 3 ms, and with processing in the code, both on the server and on the client. Unfortunately, the fix only worsened the situation: the disappearance of the error was noted in a very small percentage of users, but it appeared in a much larger number of players.

These were users who play Skyforge via 3G / 4G modems and / or weak PCs. We could not repeat this new bug locally. Fortunately, we have bots. We included timeouts in them and repaired the code for processing them, immediately receiving just a shaft of errors 107. Next, fixing the bug was already a technical matter. After issuing a fix on the combat popularity of error 107, it almost disappeared. Now there is only 1 unclosed complaint.

In fact, there were several reasons for error 107:

The lesson of error 107 once again showed how important it is to investigate not fully understood behavior in the system. And also - how important it is to have objective statistics on the distribution of certain errors among players. Such statistics, by the way, should appear by the start of the OBT.

Dry statistics on the progress of players who took part in testing:

X - Registrations total
0.96 * X - Z1. Dankit Island
0.84 * X - Z2. Isola excavations
0.58 * X - Z3. Lanber Forest
0.37 * X - Z4. Naori Island
0.11 * X - Z5. Milensky caves

Unfortunately, absolute values ​​cannot be published. Although in my humble opinion, they are more than worthy.

On the Rights of Conclusion


Closed beta testing is one of the most important stages before launching a game for a wide audience. In my opinion, our team coped with the last stage of the PTA quite adequately. I want to thank all the users who took and participated in the PTA, your reports help to make the game better. And also invite everyone to a stress weekend. I hope that together we can make the server go to pieces :)

Source: https://habr.com/ru/post/249791/


All Articles