Missing structural elements in OpenStack enterprise level: Part 1 - High Availability

Author: Dmitry Novakovsky

Now is a great time to be a company participating in the OpenStack initiative - you get most of the data for marketing and product management by simply talking every day with customers and partners. Anyway, the competition in this area is quite high, so for the community and for individual vendors it is important to competently create a reserve of functional capabilities and set their priorities, while clearly recognizing who and what they want. I will act as a “captain of evidence”, but still I’ll say that the needs of the Enterprise are very different from the needs of a service provider, a government or an IT department operating on the World Wide Web.

In this post (and in the next few), I will share my thoughts on the functionality — actually “building blocks” —- which are still missing from OpenStack, but are necessary for the platform to be used successfully at the Enterprise level. In addition, you will find out whether work is currently being done to eliminate this gap and, if so, what solutions exist.
')

Missing structural element number 1: high availability / resiliency at the enterprise level

High Availability (High Availability, or HA): for Enterprise, these are probably the two most important letters in the virtualization / cloud context. In a nutshell, its presence means that if for any reason a virtual machine (VM) fails, for example, due to the operating system failure, the failure of the entire hypervisor node, etc., then The data center / cloud management platform "brings it back to life" in the shortest possible time. This can be done through a quick restart on the same hypervisor host or an emergency transfer (evacuation) to another hypervisor host. The “extreme” mode for VIP virtual machines is “Fault tolerance”, or the operation of two VMs on different hypervisors with CPU / memory state mirroring so that there is always at least one VM that still remains operable, which could be accessed in the event of a disaster.

Why does the company need high availability support?

Historically, the success of vSphere at the Enterprise level has largely been based on the perception of existing applications as belonging to the “pets” class. Such applications have been actively developed for many years, they work on bare metal and are maintained in working order by special teams. Applications of this type are usually not ready to work on the cloud. The built-in intellectual processing of failures is practically inherent in them, but they are successfully used to meet the needs of the business and the budget for their development is planned for many years to come.

In addition to consolidating on a smaller number of physical servers, vSphere improves the “quality of life” of these applications, helping them recover from failures, without requiring any “accounting for the work of virtualization / cloud services” from them. To succeed, the OpenStack platform must be able to perform the same function.

What about high availability in OpenStack?

The good news is that the “bits” needed to maintain a high degree of accessibility are already available, so building up a total “accessibility-as-a-service” for OpenStack requires less effort than one would expect.

OpenStack supports several shared + distributed server storage systems that are suitable for dynamic migration / emergency transfer (we have Ceph as our favorite system in Mirantis), and even the “nova evacuate” command is implemented in Nova, which leads to a call to a number of APIs. for an emergency transfer of a VM to another hypervisor host.

What is missing is the control + monitoring component (and, of course, a beautiful user interface and powerful PR). Some process still has to carefully monitor the work of a VM with high-availability support at various levels (accessibility of the hypervisor, the performance of the nova-compute, the answer to the VM ping, etc.), and after making the “everything, she died” decision to run emergency transfer through Nova. In addition, of course, such a system must ensure the success of the emergency transfer performed.

The bad news is that the OpenStack community has been (and to some extent still remains) inconsistent in defining the development vector of OpenStack in the context of application availability. Fortunately, the last Atlanta Summit strengthened the view that the “Enterprise Conquest” is needed and, while maintaining respect for the original OpenStack principles of “using DevOps / readiness to use the cloud”, many openly speaking community members support the idea of “creating a service that would use The Nova API functions for monitoring other services or all VMs and automatically performed certain actions, such as launching another instance from the last snapshot of the volume, creating additional copies, etc. ".

The most unpleasant (or perhaps just unfortunate) moment is that until the community developed a consistent position, some potential customers who are considering implementing OpenStack could get the wrong message and think: “OpenStack will never care about ensuring high availability. beyond the controllers' own infrastructure. ” I wonder if we still have time to regain the trust of these people.

And now comes the moment of truth: who will write the code, and when will it turn into useful functionality?

Temporary solution

Some may argue that a possible solution is to configure Nagios or Zabbix systems that perform intensive polling of “pet” class virtual machines and scripts that trigger emergency transfer. This may work in some strange “do-it-yourself” environment, but I think that, in the context of management, this is too cumbersome for the enterprise level. Do not forget that IT is often still a cost center in the enterprise, so we need to facilitate the work of IT staff, and not vice versa. Further, you can also consider using Heat as a state machine, and Ceilometer as an emergency administrator, but at least there are currently no suitable success stories to talk about.

The real trade off in this case is to start deploying OpenStack while simultaneously using the KVM and vSphere hypervisors (provided that the enterprise has certain vSphere licenses). OpenStack can be useful for self-service / collective leasing / orchestration and hosting applications that are ready to work with KVM-based cloud, and vSphere will do what it does best - to act as a host for pet-class applications and take care of so that they are content with virtualization “like bare iron”.

Fortunately, VMWare has invested heavily in the development of the vCenter driver for Nova, and, as Kenneth Hui explained in a series of excellent posts , the HA, DRS and vMotion functionality is functional, even working under OpenStack. You can even easily take advantage of this customization — see our latest posts on how to use Mirantis OpenStack to build your first set of OpenStack + vSphere,.

What other functionalities, in your opinion, does OpenStack need in order to succeed at the Enterprise level?

PS: Do not forget that high availability support is included in the vSphere starting with the release of the Essentials Plus Kit, the second least expensive VMWare offer after the ESXi-only Essentials Kit, but to use it you will also need a vCenter license.

Original article in English .

Source: https://habr.com/ru/post/237493/

All Articles

Missing structural elements in OpenStack enterprise level: Part 1 - High Availability

Missing structural element number 1: high availability / resiliency at the enterprise level

Why does the company need high availability support?

What about high availability in OpenStack?

Temporary solution

More articles: