The architecture of the online game server on the example of Skyforge

Hi, Habr! I'm Andrei Frolov, lead programmer, working in Mail.Ru on Next-Gen MMORPG Skyforge. You could read my article about database architecture in online games. Today I will reveal the secrets concerning the device of the Skyforge server. I will try to tell you in as much detail as possible, with examples, and also explain why this or that architectural decision was made. Without exaggeration, we can write a whole book on our server, so in order to fit into an article, I will have to go through only the main points.

Overview

A server is nearly two million lines of Java code. To connect to the server and display a beautiful picture, use a client written in C ++.
Fifty programmers contributed to the server code. The code has been written for many years by the best specialists of the Russian "Orthodox" game dev. It contains all the most successful ideas from around the world.
At the moment, we have written about 5,200 automated tests, continuous integration and load testing using bots have been established.
The server can run and work on tens and hundreds of servers, support the game of hundreds of thousands of people at the same time. We decided to abandon the traditional MMO shardirovaniya technique and run all the players into one big world.

')
The first and most important rule of server development: the client is in the hands of the enemy. The client is protected, but theoretically it can be hacked, it can decipher the client-server protocol. Hacking a client can lead to bypassing game rules, cheating, boating, etc. Such things destroy the game for everyone. To avoid this, we must emulate the entire game world with all the game rules on our server, and the client should be used only to display a beautiful picture. In addition, the client should be checked for hacking, tracking suspicious behavior, etc.

Service Architecture

One of the main features of the development is that we do not know how many players we will have. Maybe only one - the developer himself, or maybe 100,000 at a time. Therefore, the server should be able to run in a small configuration, on a laptop, and stretch to tens and hundreds of powerful servers, if necessary.

The second feature is that at the start of development we had no idea what our game would be about, what features, services, etc. would be in it. The server structure should be as flexible as possible in terms of adding new services and features.

The third big problem is multithreading. As you know, the best way to cope with multithreading is to avoid it. Deadlock, livelock, lock contention, and other issues that are dear to the programmer’s heart can be circumvented if the server architecture saves you from having to synchronize the threads manually. Ideally, a programmer should generally write simple single-threaded code and not think about such things.

From here, our universal server structure was born, which is used in Skyforge:

There is a pool of physical servers on which the game will run. This set of servers and our server application that runs on them is called Realm.
Each server runs a server application (JVM), called a role. There are different roles: account server, game mechanics, chat, etc. Each role assumes a large piece of functionality. Some roles exist in the singular, some run in multiple instances.
The role consists of a set of services. A service is a regular thread (thread), which deals with its specific task. An example of a service might be an authorization service, a name reservation service, a load balancer, etc. Each service knows nothing about the physical location of other services. They may be nearby, or they may be on a different physical machine. Services interact through a messaging system that hides such details from them.
Each service consists of a set of modules. A module is a “piece of functionality” that has one tick () method. An example of a module may be a statistics module, a transaction execution module, a time synchronization module. The whole work of the service is to call the tick () method on its modules in an infinite loop. One such cycle is called a “server frame”. We consider the rate to be good if the server frame varies from 3 to 20 ms.
This whole structure is described in XML files. The startup system just needs to “feed” the role name. It will find the appropriate file, launch all the necessary services and give them the list of modules. The modules themselves will be created using java reflection.

Thus, we can launch the “local server” role, where everything is needed, and we can split the server into several dozen roles — account server, item server, game mechanics, etc. - and run it on dozens of different physical servers. The structure was extremely flexible and convenient, I advise you to seriously look at it.

Basic services

There is a set of services that carries the main game load. Each server must be able to scale. Ideally - to infinity. Unfortunately, writing servers so that they scale up is not an easy task. Therefore, we began with the fact that we made the main services scalable, and every additional trifle that does not carry the main load was left for later. If we have a lot of users, then we will have to improve them to ensure scalability.

Account service. Responsible for authorizing and connecting new customers.
Server game mechanics. This is actually the game itself. After passing the authorization, the client connects here and plays here. The client does not directly interact with other services. There are several dozen or even hundreds of such services. These services bear the main load.
Database services. These services perform operations on the data of game characters, their objects, money, progress of development. They usually run a few pieces. More information about the architecture of the database can be found in my last report. ( habrahabr.ru/company/mailru/blog/182088 )
Chat. Routing chat messages between users.
All other services. There are several dozen of them, and they usually do not create a strong load, so they do not require separate servers.

Network

By the word “network” I mean the message delivery system from one service to another or from one object to another. Historically, we have two such systems at once. One is based on posts. The second system is based on remote procedure call (RPC). In Skyforge, the message system is used inside the game mechanics service to send a message from the avatar to the mob, as well as for communication between the client and the server. RPC is used for communication between services.

Messages

All objects that want to send or receive messages are called subscribers. Each subscriber is registered in the general directory and receives a unique identifier - the address. Anyone who wants to send a message to a subscriber must specify the addresses “from” and “where”. The network engine knows where the subscriber is located, and delivers the message to him. A message is a Java object that has a run () method. When sent, this object is serialized, arrives at the target subscriber, is deserialized there, and then the run () method is called on the target subscriber.

This approach is very convenient in that it allows you to implement simple commands like “strike”, “give unlock”, “start fireball”. All this logic is external to the object on which the action is performed. The big disadvantage of this approach is that if the command logic requires the execution of some code on several subscribers, then we need to make several messages that will send each other along the chain. The logic is fragmented into several classes, and the message chain is often quite long and difficult to unravel.

RPC

Remote procedure call or RPC appeared to solve the message thread problem.
The basic idea is to use cooperative multitasking (Coroutine, Fibers). Anyone who is not familiar with this concept, I advise you to look into Wikipedia for understanding the topic. en.wikipedia.org/wiki/Coroutine .
A service that wants to be called through a remote procedure call must implement a special interface and register in a special directory. Then anyone can ask the directory to give him the interface of this service, and the directory will return a special wrapper on the service. Calling services via RPC is possible only inside a coroutin, i.e. A special execution context that can be interrupted and resumed at break points. When calling wrapper methods, it will send RPC calls to a remote service, interrupt the current filer while waiting for a response, and return the result when the remote server answers.

Thus, we concentrate logic in one method, rather than spreading it over hundreds of messages. The code is greatly simplified, it can be written in terms of calling functions of some objects, and not in terms of sending messages. But there are problems with some kind of multithreading, since after we returned from a remote call, the environment could already have changed. In general, this approach is very convenient when the service has a limited interface of a dozen methods. When there is a lot of methods, it is better to split the interface into several.

You can learn more about our implementation of the firewalls from the lecture by Sergey Zagursky ( www.youtube.com/watch?v=YWLHELcvNbE ).

Serialization

In order for us to have a system for sending messages and remote procedure calls, we need a client-server protocol and a way to serialize / de-serialize objects. Let me remind you that we need to send commands and data from the client to the server, i.e. from C ++ to Java and back. To do this, we use Java classes to generate copies of them in C ++, and also generate methods for serializing and deserializing objects into a byte stream. The serialization code is embedded directly inside the classes and accesses the fields of the class directly. Thus, the server does not spend processor time to crawl classes using reflection. We generate all this with a self-written plug-in for IntelliJ IDEA. Intraserver protocol for communication between services is completely analogous to the client-server protocol.

When serializing any class into a byte stream, first the class id is written, then the field data of this class. On the other side, the id is read, the corresponding class is selected, and a special constructor is called from it, which restores the class from the byte stream.

Game mechanics

The main service that would be of interest to you is the game mechanics service. It is there that the entire code that is directly related to the game is executed, it is there that the entire game world is modeled, the fireballs fly and “robbed are corowed”.

Maps and load balancing

On the game mechanics servers, maps are created on which, in fact, there are players, mobs and all the fun happens. Each card has a limit on the number of players that can be on it. For example, the limit may be equal to one for personal adventures, 10–30 for group activities and 250 for large cards. What happens if another player wants to hit the card when the limit is reached? Then another copy of the same card will be created. Players from these cards will not see each other, will not interfere with each other. Those. in any gaming city there may be thousands of people, but there will not be crowded. This way of organizing players is called “channels”.

The creation of cards is the responsibility of the central card balancer service, which distributes the cards among the game mechanics services depending on the population, load and other magical causes, trying to maintain a uniform distribution of the load and the normal density of the players so that they do not get bored.

On each server of the game mechanics, information about the passability map, collisions and other similar things is loaded. When a player or mob tries to move to any point, the server calculates whether the player can get there, whether he is trying to count and go through the wall. When a player tries to throw a fireball at an enemy, then using the same information, the server calculates whether the player sees the enemy and if there are no obstacles in his path.

Avatars and Mobs

An avatar is a character controlled by a player, a mob is a monster that a player kills. These are very different, but often very similar entities. Both mob and avatar can walk on the map, they have health, they can use spells, etc. Only the avatar is controlled by the player, and the mob has its own brain. In addition, the maps have a lot of chests, plants and other interactive entities. Very often you need to do some kind of functionality and hook it to different entities. For these purposes, we use the component approach, collecting game essence from a set of functionalities. Let me explain by example. Suppose a player and mob have a health indicator. In this case, we design the “health” element as a separate Java class in which we describe how health behaves: how it can decrease, how to recover, what timers are, and so on. Then we simply add all the functionality to a special HashMap inside the entity and take it from there as necessary. We have hundreds of such components, half of the game mechanics are collected on them.

Since the server application is very complex, the occurrence of errors is inevitable. You need to make sure that the occurrence of an error, even an unprocessed NullPointerException, does not cause the server to crash. You can simply log the error and go further, but if the error occurs in the middle of some long action on the avatar, then the avatar may be in a broken and inconsistent state. Here we come to the aid of a concept called "locale". A locale is a context within which objects can refer to each other. Objects from one locale cannot refer to objects from another. If an unhandled exception is thrown from the locale, then the locale is deleted entirely. Avatars, mobs and other entities are locales, are deleted entirely and cannot keep links to other avatars and mobs. Therefore, all the interaction between avatars and mobs goes through the message system, although they are together on the same machine and in theory could keep a direct link to each other.

Replication

You need to simulate the game world not only on the server, but also partially on the client. For example, the client needs to see other players and mobs that are next to him. For this, client-server replication mechanism is used, when updates from the surrounding gaming world are sent from the server to clients. This is done with the help of a code generator that embeds sending updates into the setters of server Java objects. A circle of a certain radius is created around the player, and if someone, for example, another avatar, falls into this circle, he begins to replicate to the client. There is a fundamental problem with replication. If N avatars are crowded in one place, then N replicas will have to be sent to each of them. Thus, a quadratic dependence arises, which limits the number of avatars that can be collected in one place. It is because of this fundamental quadraticity that all MMO customers are slowed down in capitals. We avoid this problem by limiting the number of players on the map and distributing them across channels.

Resource system

In the game there are hundreds and thousands of spells, items, quests and other similar entities. As you can probably guess, programmers do not write all hundreds of quests, game designers do it. A programmer develops one Java quest class, and descriptions of all quests with their logic, tasks, and texts are contained in XML files called resources. When the server starts, we load these resources and, based on them, collect Java classes describing the world. These classes can already be used by the server. Approximately the same system exists on the client side, only there resources are not loaded from XML files, but the pre-created “piece of memory” is simply loaded, containing all the necessary objects and links between them. There are hundreds of thousands of resource files in our server, but downloading them on the server takes about two minutes. On the client, everything is loaded in seconds. The system is very sophisticated, supports features such as prototypes and inheritance, nested descriptors, etc. Above the resource system, we have created specialized programs for editing maps and other game entities.

Server in action

Let's now take a look at examples of several scenarios of how the whole system works in action.

Kill dog

The classic test, which we always carry out, if we have greatly changed the infrastructure and want to check that everything is working, is called "Kill the Dog." You need to go to the server to the client and kill some mob there. This test covers almost all the main points of the server and serves as an excellent example to put all of the above together. Let's take a look at the points, what and how is happening when killing an unhappy dog. Of course, some steps are simplified, but this is not critical for understanding.

The client sends a message to the account server: "I want to enter the game."
The account server requests the database, performs authorization and requests from the balancer the card on which the player was last.
The balancer selects a card from the already loaded ones or creates a new one on the least loaded game mechanics server.
The client connects to the mechanics where the map was created for him. While he connects, his avatar is loaded for him.
The server begins to replicate all the objects around the avatar to the client. The client draws a smart picture and sends commands to the server that the player sends.
The player begins to run around the map, and the server moves him around the world and replicates changes in the surrounding reality. The player finds a mob and presses the "hit" button.
The “hit” command arrives at the server, the server checks that the strike is possible, and a message is sent to the mob about damage.
The “damage” command is worked out on the mob, calculates all the resists and other similar things, then takes the “health” functionality and writes off a certain amount.
The client is sent a response with confirmation of the damage, the client draws a blow.

Scaling

Let's go on the other side and see how the server behaves under load.

0 customers. If there is no one on the server, it can be launched in one application with minimal settings and without maps. There is no activity on the server, and most of the time it is idle.
1 client For one client, you have to create a map, mobs, server objects that begin to consume memory and processor time for their life.
500 customers. 500 clients are usually already quite a lot so that the processor time of one person is not enough for the server to work. You have to run realm on several machines or on more powerful servers.
10,000 customers. 10,000 clients already require multiple servers. Since most of the load is on game mechanics, you need to run realm with additional services of game mechanics.
100,000 customers. With 100,000 simultaneous players, more than half of the servers are busy with game mechanics.
More customers than iron. If suddenly there will be more players, and the iron will suddenly come to an end, we will have to restrict people from entering the game until new servers are brought up. To do this, there is a queue at the entrance, which makes the players wait for the server to be ready to accept them. This queue ensures that at the same time one realm cannot contain more players than we are ready to accept. In turn, players may start to bet even if, due to a bug or for some reason, the server suddenly began to work more slowly than a certain threshold. It is better to make an acceptable service for a limited number of customers than to fall for everyone.

Conclusion

I hope our experience will help you understand how modern game servers work, and create your own, if it comes to that.
To better understand other aspects of game development, I would recommend you read the articles of my colleagues.