In
one of our previous posts , we described the architecture of disk storage. The article received a lot of feedback and got the idea to describe the entire current architecture of our cloud.
We
have repeatedly talked about
different components at professional conferences. But, firstly, not everyone has the opportunity to visit them, and secondly, our architecture is dynamic, constantly evolving and complemented, so that a lot of information is no longer relevant.
Conventionally, the current architecture of the Scalaxi platform can be divided into three significant parts:
- virtualization pools;
- network system;
- managing cluster.
Virtualization pool
A limited number of physical servers for running virtual machines and a system for storing disk images of virtual machines combined with a common high-performance communication bus.
Component architecture:
- virtualization system;
- block access storage system;
- Infiniband tire.
The overall architecture is shown in the diagram:

Virtualization system
The virtualization system consists of diskless virtualization servers with the
open source Xen 4.0.1 open source hypervisor.

')
Virtualization Server Configuration
- Intel Xeon 55xx / 56xx CPU;
- RAM 68-96 GB;
- there are no local drives;
- Infiniband-adapter;
- Ethernet adapter (for power management and emergency access via IPMI 2.0).
The virtualization server is loaded on the Infiniband network from the managing cluster. When loading, a control Xen domain is created with
SUSE Linux Enterprise OS. Then, at the request of the managing cluster, client
para -virtual and
HVM domains (virtual machines) with Linux and Windows OS are created on the virtualization server.
Each virtual machine has:
- 16 virtual processor cores;
- the specified amount of allocated RAM;
- internal and external network interface for communication with other client machines of the same account and the Internet, respectively (IP addresses for both interfaces are allocated using a DHCP server on the management cluster);
- primary and secondary drives (block devices).
Depending on the settings transferred by the controlling cluster, the following limits are configured:
- on the use of the processor with the help of Xen sched-credit sheduler, the weight of each client domain is equal to the number of GB of allocated RAM;
- on the use of internal and external network channel using Linux-utility tc ;
- on the use of the channel to the disks using the Linux driver blkio-throttle.
Block Access Storage System
A block access storage system consists of two types of servers: a proxy and nodes. More
with photos in the last post.
For work of storage at the block level, the implementation of the
SCSI protocol operating in the
Infiniband environment -
SRP is used . In SRP, as well as in other network SCSI implementations, targets and initiators (servers and clients) are used. Targets supply the initiators with special SCSI moons (logical blocks).
SRP initiators and multipathd daemon are running on virtualization servers. The latter aggregates identical moons from different proxy servers into one virtual block device, providing fault tolerance. If one of the proxy servers fails, multipathd will switch the path to another proxy server, so that the virtual machines on the virtualization servers will not notice the failure.
The device created by multipathd is divided into logical devices according to data from a single logical group of LVM volumes. The resulting blocking devices are transferred to virtual machines, which see them as disks. If you need to change the disk size for a virtual machine, you need to change the size of the logical device in the
LVM logical group, which is a very simple operation.
Storage nodes are servers with disks running SUSE Linux Enterprise OS. They use SCSI targets from the
SCST driver. As storage system nodes, you can use any data storage system that uses
SRP ,
FC, or
iSCSI protocols, such as NetApp, EMC, and others.
Proxy servers perform data replication functions and combine storage space nodes into one logical LVM group. Data replication is performed using the Linux
md driver between multiple storage system nodes. For this purpose, from the moons of several storage system nodes, a raid of level
1 + 0 is assembled with a given level of redundancy. By default, the backup level is 2x (each virtual machine disk is stored on two nodes of the storage system).
Storage node configuration
- Intel Xeon 55xx CPU;
- 96 GB RAM;
- 36 SAS2 600GB disks;
- Infiniband-adapter;
- Ethernet adapter (for power management and emergency access via IPMI 2.0).
Proxy Configuration
- Intel Xeon 56xx CPU;
- RAM 4 GB;
- there are no local drives;
- 2 x Infiniband-adapter (one for connections with nodes, the second for connections with virtualization servers);
- Ethernet adapter (for power management and emergency access via IPMI 2.0).
Infiniband tire
The Infiniband bus consists of two main elements: Infiniband switches and gateways to the Ethernet.

For pools, 324-port
Grid Director 4700 Infiniband switches are used . Currently, the switch is reserved at the module level (it has a fully passive backplane and a modular architecture when the operation is not interrupted when a module fails). In the future, during development, Infiniband switches will be reserved at the chassis level.
Ethernet Gateway is a server with SUSE Linux Enterprise OS and Infiniband adapter and Ethernet adapter. Gateways are reserved.
Network system
The network system consists of two main elements: switches and multiservice gateways.
Juniper EX8208 switches route IP traffic. Switches are reserved.
Juniper SRX3600 multiservice gateways protect the system from parasitic traffic by recognizing various types of attacks using the signature library. Multiservice gateways are reserved. Photos
can be viewed .
Managing cluster
The main elements of the management cluster are the storage system, nodes and system services running on the nodes.

The management cluster storage system is an
HP MSA2312sa hardware raid with two controllers each. Each raid is connected to 4 nodes of the management cluster over
SAS . Data between two raids is reserved at the service level.
The nodes of the management cluster are servers running SUSE Linux Enterprise OS and Xen 4.0.1 hypervisor. Each management cluster service is one or more virtual machines with associated software.
Services
The management system includes the following system services:
- virtual resources management system (CloudEngine), an application developed using Ruby on Rails ;
- virtual resources control panel, web application developed using Ruby on Rails;
- Zabbix- based system for monitoring the status and load of virtual resources;
- LDAP user rights management system;
- DHCP servers for distributing IP addresses to virtual machines and cluster servers;
- DNS servers for domain names;
- a console service that provides access to VNC consoles for virtual machines.
- TFTP server for downloading cluster server images;
- HTTP server to download virtual machine pattern templates.
Each service is launched in two or more instances, data replication is performed using MySQL database replication.
All this documentation is available on our
wiki , where there is also a FAQ and a description of our API, with examples.