Clodo segment in the second DC and work on the bugs

Last week we launched the second Clodo cloud segment in commercial operation - in the KIAEHOUSE data center located on the territory of the Kurchatov Institute.

We tested and debugged it for a long time. The main task that we set for ourselves when working on the second segment was to do “work on errors”, get rid of the shortcomings that led to those events about which you, dear readers of our Harablog, do not get tired of reminding us in our posts .

As you know, no accident and no simple one can do without human factor anyway. People, whatever they were professionals, tend to be wrong. Therefore, we have tried to minimize the human factor and automate all the life processes of the cloud. In addition, we have set ourselves a number of tasks related to stability.

When connecting new nodes / exiting and malfunctioning and subsequently entering existing ones, these nodes must be configured without human intervention.
All important systems must be clustered. When a node of a clustered resource (for example, a database) fails, the resource
should work on other nodes, and a failed node should rise automatically.
The architecture should be as portable and transportable as possible (we want to use it in two DCs and be able to expand with the least overhead).
With the loss of connectivity between the DC, as well as the failure of one DC, the client should not lose the opportunity to work with entities in another / other DC.

The main stronghold of the dehumanization of our cloud has become Chef , repeatedly described on Habré. Chef is a configuration management tool responsible for the reproducibility of the software architecture. Chef has the following entities:

Chef server
Chef client
Chef Solo is a client without server support.

Ohai is used to work with configurations, Knife is used to interact with the Chef Server API.
The boot sequence of our cluster is:

Chef Solo is loaded on the cluster controller. He has one task - to install and configure Chef Server.
Hardware nodes are loaded from a Debian Live image with Chef Solo installed. Chef Solo configures the Chef Client and assigns a role to the node (it can be a Storage node, a XEN node, a relay, a web server).
Chef Client, communicating with the Chef Server, configures the equipment node to its role.

Storage nodes, web servers and databases are clustered. Pacemaker with a CoroSync communication layer is used as a cluster resource manager.
')
Separate words deserve our control panel . At the same time with serious external changes, the panel radically changed its device. Now this is a client of our API . Physically, we have a set of API servers for each DC (kh.clodo.ru for KIAEHOUSE and oversun.clodo.ru for Oversan-Mercury) and separately - api.clodo.ru, hidden behind the DNS Round Robin. The panel is an API client and all management of servers and storages from the panel is carried out through the API. With the servers located in some DC, the panel communicates through the API of this data center. If connectivity is lost between the DC, the panel will continue to work. If some very big accident happens in one data center, the machines from the second DC can be operated as before.

Now we are busy reworking the monitoring of all nodes. Soon we will have a completely new monitoring system. Now we are carefully studying Icinga . This is a very interesting tool, unfortunately, not yet illuminated on Habré. A number of serious tasks are posed for monitoring. We pay special attention to the accuracy of alerts - in the event of a malfunction of a node on which several systems are tied, we want to receive alerts specifically about the faulty node and not to drown in other alerts. In addition, monitoring should function independently of the connectivity between the DC, be clustered, be able to distribute the load between the cluster nodes and monitor itself very well.

Of course, there are much more changes, but we cannot cover all of us with one brief description. We still have a lot of work to do - so that using a segment in two DCs we could build a fault-tolerant solution, we will definitely add the rental service of balancers. However, it seems to us that with the commissioning of the segment in the second data center, we did an important part of the work on the errors. In the meantime, we want to invite Habr's audience to choose what we should talk about in more detail in the following material. Poll - in the next post.

Important PS Anyone who wants to test the updated Clodo can send a letter to info@clodo.ru . After asking you a couple of questions, we will offer you a promotional code or other suitable testing conditions.

Source: https://habr.com/ru/post/131333/

All Articles

Clodo segment in the second DC and work on the bugs

More articles: