📜 ⬆️ ⬇️

Technologies that improve the resiliency of VPS

Recently, we decided to go beyond the budget server segment: revise our vision of hosting virtual machines and create the most fault-tolerant service.
In this article I will tell you how our standard VPS platform is organized and what techniques we used to improve it.

Our standard VDS creation technology
Now the hosting of virtual servers with us is as follows:

Single-rack servers of approximately the same configuration are installed in the racks:

One of the servers is the main one. VMmanager is installed on it and nodes are connected to it - additional servers.

In addition to VMmanager, client virtual servers are located on the main server.
Each server “looks” to the world with its network interface. And to increase the speed of VDS migration between nodes, servers are interconnected by separate interfaces.
')

(Fig. 1. The current scheme of hosting virtual servers)

All servers operate independently of each other, and in case of performance problems on one of the servers, all virtual servers can be distributed (the “Migration” function in VMmanager) to neighboring nodes, or transferred to a newly added node.

Situations when the server fails (kernel panic, spilled out drives, dead BP, etc.) entail the inaccessibility of client virtual machines. Of course, the monitoring system immediately notifies the responsible specialists about the problem, they begin to clarify the causes and eliminate the accident. In 90% of cases, the work on the replacement of failed components takes less than an hour, plus it takes time to eliminate the consequences of the server's emergency shutdown (storage system synchronization, file system errors and other ...).

All this of course is unpleasant for us and our customers, but a simple scheme allows us to avoid unnecessary expenses and keep prices low.

New Cloud VDS

To satisfy the most demanding customers, for whom Uptime server is crucial, we have created a service with the highest possible reliability.

So, we needed new software and hardware.

Since we are already working with ISPsystem products , the logical step was to look at VMmanager-Cloud. This panel was just created to solve the problem of fault tolerance, at the moment it is well developed and has reached a certain stability. She arranged for us and we did not consider alternatives.

Ceph . Was adopted unconditionally as a distributed file system. It is a free, freely evolving product, flexible and scalable. We tried other storage systems, but Ceph is the only product that fully satisfied our storage requirements. He seemed difficult at first, with some attempts we finally figured out. And do not regret it.

The nodes of the new cluster are assembled on the same hardware as the VMmanager working cluster, but with minor changes:
We switched to multinodes with power redundancy.
For switching between cluster nodes, instead of the usual gigabit connection, we used Infiniband. It allows you to increase the connection speed up to 56Gb (Mellanox Technologies MT27500 Family ConnectX-3 IB cards, Mellanox SX6012 switchboard)

The CentOS 7 distribution was chosen as the operating system for the cluster nodes. However, in order to make all of the above work together, I had to assemble my kernel, reassemble qemu and ask for some improvements in VMmanager-Cloud.


(Fig. 2. New scheme of cloud hosting of virtual servers)

The benefits of using new technology

As a result, we got the following:


Since early December last year, the cluster has been operating in combat mode, at the moment it serves several hundred clients, during which time we stepped on a lot of rakes, dealt with bottlenecks, performed the necessary tuning and modeled all abnormal situations.
While we continue testing, economists consider the cost price. Due to the additional redundancy and use of more expensive technologies, it turned out to be higher than the previous cluster. We have taken this into account and are developing a new tariff for the most demanding customers.

There are a number of risks that we can not close at all, this is the power supply of the data center and external communication channels. To solve such questions, geographically distributed geo-clusters are usually done, perhaps this will be one of our next studies.

If you are interested in the technical details of the implementation of the above described technology, then we are ready to share them in the comments or make a separate article in the wake of the discussion.

Source: https://habr.com/ru/post/250207/


All Articles