In mid-2014, we made a decision about the need to transfer the services of the public rental service of virtual machines (hereinafter referred to as the VPS service) from the OpenQRM platform, which was selected at that time without a proper analysis of customer needs and did not meet the requirements for both manageability and philosophy of behavior I must say that the OpenQRM developers generally strangely approached the development by creating a product from a heap of bash scripts, PHP code and a heap of crutches). In general, our users were unhappy, the service was so-so and caused more damage than profit. It should be noted that our subsidiary, which just provides carrier services, is a small regional company and we did not consider the creation of a large VPS service at that moment, and the main task was a transition to a stable and reliable product that would meet the following requirements:
- easy deployment and configuration for the needs of the VPS service;
- readiness for use and rather wide user base;
- simplicity in the diagnosis of errors;
- convenient user interface;
- API for managing virtual machines.
The size of the infrastructure was not planned large - at that time we expected to use 512 - 1024 GB of RAM, 128 - 256 Xeon E5-2670 cores, 10 - 20 TB of storage, 200+ virtual machines. The service provided for the provision of virtual machines with direct assignment of public IPv4, the support of IPv6 was not discussed. As a virtualization technology, we focused on KVM. Storage - classic NFSv3.
We conducted a comparative analysis (read - tried to deploy on manuals) several products - Apache CloudStack, OpenStack, Eucalyptus and chose Apache CloudStack (hereinafter ACS) to provide services. We did not consider using a system without an API. It is already quite difficult to retrospectively restore the selection process, I can only note that we received a functioning infrastructure using ACS in 1-2 days. At that time, it was ACS version 4.3, which we still use in this cloud (upgrading to current versions does not make sense, because the infrastructure is stable, it adequately responds to adding and replacing its various parts and allows us to meet the needs of users). At the time of writing the article for release, ACS 4.10 is planned, this release includes not so many changes that provide new functionality. A small digression needs to be made here - ACS provides a large number of different services, the final choice of which results in a cloud - with or without load balancing, using NAT or direct assignment of IP, with external security gateways or without security support, etc. In general, it may turn out that within some deployment topologies, hypervisors, repositories, network topologies there are almost no changes between releases 4.3 and 4.10, while within other topologies of these changes there may be a significant amount.
We use the simplest deployment option - a public cloud with a common address space without special network services (this is called ACS with basic zones without security groups). Within the framework of this deployment model, it is rather difficult to invent something new, therefore, within ACS 4.10, we are only waiting for IPv6 support.
')
The fact is that ACS is often used to provide integrated virtual services and develops faster in this direction (these are so-called advanced zones), so IPv6 support exists for advanced zones for a long time, and only now appears for basic zones. In the event that the cloud will be used to provide services to large customers in B2B or simply as a private cloud for implementation in an organization, you need to look at what opportunities are required and it is possible that from release 4.3 to 4.10 there have been some significant changes in the set of opportunities . We currently do not see within our business to provide such services to regional customers (more precisely, they are not ready to buy them or we are not able to sell them), therefore ACS with basic zones is our everything.
So, how did the operation of our infrastructure take place over the course of three years, and what difficulties have we encountered. It is likely that if you follow the notes that are described here, then the operation of the cloud can be almost painless. So, what we found in ACS for 3 years is described below.
Availability
So, let's start with uptime - we have servers with uptime more than 1 year, no unstable moments due to which ACS goes “at the peak” for three years not detected. The overwhelming number of system breakdowns occurred due to power supply problems. During the operation, we made a compensation for violation of the SLA 99.9% 1 time.
Virtual router
The worst, complex, opaque component of ACS is the virtual router. Its role is to provide DHCP services, forward and reverse DNS zones, routing, balancing, static NAT, support of passwords and ssh-keys of VM templates (cloud-init), user data. In our cloud, it is used only for DHCP, forward and reverse DNS zones, support for passwords and ssh-keys of VM templates (cloud-init), and user data. This component can be fault-tolerant, but within our deployment it does not make much sense, since ACS automatically lifts it in the event of an accident and does not affect the functionality.
If we used advanced zones with nontrivial network functions, the virtual router would play a critical role. In general, with a virtual router in ACS 4.3 there are a number of problems, some of which have survived to 4.9 and at 4.10, changes must finally be made that will solve them. The first problem we discovered is a problem with a DHCP server in Debian - it does not give out DHCP information because of a bug that is described (for example,
here ). Further, we had problems with log rotation, which caused an overflow of the virtual router file system and it stopped working. As a result, we made a significant amount of changes to the virtual machine itself, corrected the scripts, possibly broke compatibility with other functions, but made it work as it should. Currently, we reboot this component once every 1-2 months, because the cloud is at the final stage of its life cycle, when making changes has no practical meaning. It is worth noting that for large infrastructures with tens of thousands of VMs with a virtual router, there are other problems, for example, described
here . I have not yet conducted an analysis of whether this problem is solved in 4.10, but the committers' enthusiasm for its solution seems to be high (in the
Cosmic fork, it has already been definitely solved). It is worth noting that instead of a virtual router based on Debian Linux, you can use Juniper SRX, Citrix NetScaler. Currently, there is an initiative to implement a virtual router using VyOS (I think that it will not be implemented, since it is not behind a serious player who needs this solution).
Scenario of setting rules for iptables, ebtables on the virtualization host
When a virtual machine is started, the ACS agent hosted on the host configures the iptables and ebtables rules that restrict the network capabilities of the virtual machines (changing the MAC, assigning foreign IP addresses, illegal DHCP servers). For unknown reasons, in ACS 4.3, this scenario did not work correctly - the rules were lost, traffic stopped moving to the machines. It should be noted that in the current test cloud ACS 4.9.2 this problem is solved and does not cause inconvenience. In general, we rewrote the python script and made it work correctly. Regarding this problem, there is a suspicion that, by virtue of the experimental deployment mode, we “broke” ACS, and because of this, this behavior was observed, it is possible that if consciously configured, the problem would not manifest itself.
Multiple primary NFS storage for one cluster
This is just a heuristic rule, which we began to adhere to in the end. Do not use multiple repositories for a single cluster (a cluster is a hierarchical ACS entity that combines multiple virtualization hosts and repositories and is a way of isolating failure domains). In general, while we used several storages within the cluster, the stability of our cloud was lower than after merging all the storages into one). Currently, for the entire cloud, we use a large server with software RAID6 on Samsung Pro 850 SSDs and regular backups.
ACS User Self-Service Interface
The ACS interface is quite conservative and oriented towards administrators, and the user who has not previously used the comprehensive VM administration tools is unambiguously intimidating and requires substantial work to describe his functions and how to perform various tasks. In this sense, the interfaces that AWS, DO, and other leading VPS service providers provide provide the user with the best UX. As a result, from time to time, the support service has to explain to the user how to perform this or that non-trivial operation for quite a long time on the phone (for example, how to create a template from a running VM).
Instead of conclusion
It should be noted that at the moment these are all problems that we, after three years of operation, can identify as important and having a significant impact on the quality of service. Of course, there were other, less significant problems, incidents and situations that required the intervention of administrators to eliminate them.
We are currently planning to deploy a new cloud on 288 Xeon E5-2670 cores, 1536 GB RAM and 40 TB SSD storage using ACS 4.10 (Basic Zones, Security Groups). In order to provide our users with a better service, we also initiated the creation of an alternative interface specifically for this deployment, which is created as an open product
CloudStack-UI and takes into account the experience that we have gained from operating the current cloud.