How we made friends in bank infrastructure using ManageIQ

A couple of years ago, the main trends were automation, DevOps practices and the acceleration of the delivery of values to the market. Home Credit Bank decided to keep up and took a course on the development of technologies, all the more so as the open whispered whisper of users who were tired of waiting a few days to wait for new resources for their important projects rang louder on the open space.

We decided to start with the process of approving applications by departments, which, as in many large companies, required time and effort. As the first task, we chose the process of creating a virtual machine regardless of the virtualization environment. Making a list of tasks, we realized that it would be necessary to integrate with other systems used in the infrastructure of our bank, for example, via API.

The most suitable solution was ManageIQ . This is a project that Red Hat acquired in 2012 and based on it created the commercial Red Hat CloudForms product . At the same time, ManageIQ remained in the status of an open-source product and is developing in parallel with CloudForms.

ManageIQ is written in Ruby and supports a large number of different providers of virtualization, public clouds and containerization. At the moment, we are using a version of Gaprindashvili in the High-Availability configuration in Home.

How the process has changed

Previously, each team required separate settings in its area of responsibility. After preliminary preparation, all data was collected and sent to the administrator, who deployed and configured the virtual machine. Then it was necessary to inform, for example, the monitoring team that a new host had appeared that needed to be added to the monitoring. Delays in communication, workload of specialists, errors caused by the human factor, could stretch this process to several days.

Having fit the whole process into ManageIQ, we got the following results:

Virtual resource type	Before introducing ManageIQ	After implementing ManageIQ
Linux virtual machine in VMware / oVirt	To one	~ 10 minutes
Rancher virtual machine environment	working	~ 15 minutes
Windows Virtual Machine in VMware	weeks	~ 25 minutes

The time difference is due to the fact that in the second case, additional time is required to prepare the host for working with Docker, download and tegrate images for infrastructure containers from Artifactory, because at this stage there is still no access to the Docker Hub. In the case of Windows, the difference is achieved due to the fact that, firstly, the creation time of a Linux VM without customization is approximately 2 minutes, and that of a Windows VM is 6 minutes. Secondly, customizing Windows itself takes about 10 minutes, versus 2 minutes for Linux.

10 minutes is not so fast, considering that approximately 2-3 minutes are spent directly on the process of creating a VM. For the remaining time, ManageIQ manages to do the following:

The system collects the parameters specified by the user in the order form and decomposes them into variables.
A new change request is created in the incident management system, which displays data about the new resource.
The ManageIQ Resource Name Query System sends a value for a new resource.
The IP address management system issues a new address based on the entered parameters.
A new DNS record is registered on the local DNS server.
Based on the parameters, environment and resource load, the type of virtualization and cluster for placement are selected.
Next, the process of creating a virtual machine with the specified parameters.
When the virtual machine is deployed from the template, you need to run scripts that will make the final settings:
- expanding the drive to a specified size,
- generating a new root password, changing it on a Linux host and writing to a password manager,
- creating a configuration YAML file for Puppet in GitLab,
- run runbooks that bring the necessary settings and updates for Windows VMs or
- launch Puppet, which will update and configure Linux machines.
After all this, the change request created in step 2 is closed. Fresh data is added to it, such as the IP address and host name.
A new unit is registered in the Compute Resource Management Base (CMDB).
The virtual machine is registered in Zabbix and added to monitoring.
The customer and other interested parties receive an e-mail with information about the new unit created using ManageIQ.

What's inside

Let's delve into the technical details of the product. By default, ManageIQ can create a virtual machine from a template. How does this differ from what we do, for example, in vCenter? The correct answer is nothing. ManageIQ uses the same methods as virtualization systems, but does it from a single place. In addition to this, you can add your own scripts that do not fit into the standard set of features. Thus, if you have resources, for example, in public Azure, in vCenter, which is deployed on your own hardware, plus the Kubernetes cluster is spinning somewhere else, then all this can be conveniently managed from ManageIQ.

In addition to a wide variety of providers for integration, ManageIQ has convenient tools for customization. This, for example, creating convenient forms for solving your problem:

Thanks to this, it was possible to construct a full-fledged interface for ordering a virtual machine, fitting all the necessary parameters into it:

We select the amount of computing resources, OS, fill in all the additional information that is needed for integration with external systems. Further, using internal mechanisms (about them a little later), the system chooses where new resources will be placed: the data center, cluster, host and datastore are selected depending on all the parameters entered and the resources are loaded.

Do not forget that people can order too many resources or not at all what they really need. Here the system of requests and confirmations comes into play:

Any resources ordered by the user must be approved by the responsible person. In Home, a group of architects does this.

Automation structure

If you decompose all the automation processes in ManageIQ into small parts, you will notice a certain structure.

Automate Domain

Datastore hosts all the domains that ManageIQ has.

By default, there is a ManageIQ domain that is locked and is something like a reference model. If you need to make changes, another domain is created, into which elements from the ManageIQ domain are copied and changed for your own tasks.

Automate Namespace

Inside, the domains are divided into parts that are responsible for individual processes: this may be the section responsible for managing the infrastructure (Infrastructure) or for working with services (Service). We have our own Namespace, which contains everything related to the bank's systems.

Consider the structure in more detail using the example of the provisioning process for a new virtual machine. It is described in the Automate Class called VMProvision_VM .

Automate Class

The class has a structure that includes Instances , Methods , Properties, and Schema . From the point of view of automation, Schema is of most interest:

The layout is similar to pipeline in CI / CD systems. It describes the steps that will be performed during the automation process.

Automate Instance

The class described above has two Automate Instance. Each of them inherits from the circuit the stages for which the Default Value is set. Stages that have null values are described in the instance.

In the instance, values appeared for the steps that were empty in the schema description. It is also visible who and when made the last change.

Let's see what one of the Value values represents:

This is an Automate Class called Methods, which has one Automate Instance. Its scheme describes the ipam_base_uri attribute and the execute method. The execute method, in turn, calls the Automate Method acquire_ip .

Automate Method

This is a Ruby script that allows a virtual machine to communicate via REST API with other systems. For example, as is the case with the IPAM address space management system. In IPAM we get the address, mask, subnet and VLAN for the VM. The difficulty is that the machine can be deployed in a test environment or productive, for applications or databases. Or maybe the security service decided to place it in the PCI-DSS loop. All this information is collected at the stage of creating the VM or transmitted in the parameters of the called instance (in the screenshot above you can see that the parameter contains the uri by which the method will access IPAM):

Here is some Ruby code

base_uri = $evm.object['ipam_base_uri'] prov = $evm.root["miq_provision"] site = prov.get_option(:site) app = prov.get_option(:dialog_dropdown_list_information_system) crq = prov.get_option(:crq) descr = prov.get_option(:dialog_textarea_box_usernotes) owner = $evm.root['user'].name scope = prov.get_option(:dialog_dropdown_scope) environment = prov.get_option(:landscape)

$ evm.root is a method that returns everything that can be stored in ManageIQ. This can be information about the user, environment, variables, the current request ('miq_request'), etc. We are interested in the current provision process.

Next, we can pick up the necessary values: get_option (: site) picks up the value that was transferred at one of the previous stages, and, for example, get_option (: dialog_dropdown_list_information_system) picks up from the form that the user fills when ordering new resources.
All received values are transmitted by variables in the request body in JSON format:

 options = { verify: false, headers: {"Content-Type" => "application/json"}, body: { "site" => "#{site}", "env" => "#{env}", "app" => "#{app}", "scope" => "#{scope}", "role" => "#{role}", "crq" => "#{crq}", "descr" => "#{descr}", "owner" => "#{owner}", }.to_json, }

Using this set of parameters, IPAM will unambiguously determine in which VLAN the virtual machine should be located, and will return the network parameters.

In addition to receiving data for the correct VM configuration, ManageIQ can also generate additional information in order to make some settings at the stage of the so-called post provisioning (after the virtual machine is deployed and launched). In Home, we use Puppet to manage Linux host configurations. For each computing unit, create a YAML file in GitLab with a set of groups:

Some more Ruby code

 options = { headers: {"Private-Token" => "#{api_token}", "Content-Type" => "application/json"}, } body = { "branch" => "#{branch}", "author_email" => "email@your.domain", "author_name" => "ManageIQ Bot", "content" => "", "commit_message" => "New host created by ManageIQ", } descr = prov.get_option(:long_description) if descr.include?('rancher') && descr.include?('test') then body[:content] = "---\ngroups:\n - #{yaml_server}\n - rancher\n - user-devops-UDCR" end unless descr.include?('test') then if descr.include?('rancher') then body[:content] = "---\ngroups:\n - #{yaml_server}\n - rancher\n" end end unless descr.include?('rancher') then body[:content] = "---\ngroups:\n - #{yaml_server}\n - #{$is_id}" end

Groups depend on the type of virtual machine, the environment in which it is created, and the information system.

After successful completion of the procedure, the user receives an email with information:

The text of the letter can also be adjusted by adding the necessary information.
In the event that an error occurs at any of the critical stages of the process, you can add a condition in which it will be explicitly stated that the process should be interrupted. If the error does not have fatal consequences, also indicate what can be continued, despite the problem.

Logging

ManageIQ writes logs of everything that can be tracked. The automation process is written in automation.log. In addition there are API logs, various cloud providers, security logs, even the output of the top command is logged.

For each event in the circuit, you can configure a log entry of their start and end:

In addition, you can write your messages in the logs:

 $evm.log(:info, "Call job status uri: #{item_uri}/#{job_id}/api/json")

This is very useful when accessing systems by API to understand why something went wrong. Or, to track the current status of a lengthy process, such as running a Jenkins job or SCCM Runbook:

 $evm.log(:info, "acquire_osname --- naming jobStatus: #{jobStatus}") break if jobStatus.to_s == "Completed"

You can use the standard functions for exceptions to write to the logs:

 raise “VM not specified” if vm.nil?

By default, all logs are stored in the / var / log / manageiq / * section, but from my own experience I can say that looking for a problem through tail and grep is not a convenient solution. Given that ManageIQ writes a lot of different logs, you should take care to redirect logs, for example, to the ELK stack.

ManageIQ API

In addition to a user-friendly web interface, ManageIQ has a functional API. With it, for example, we solved the problem of dynamically determining the identifier of the template to be specified

when creating a VM:

 def get_template(vendor, os, ems) user = '#{user}' pass = '#{pass}' options = { verify: false, headers: {"Accept" => "*/*", "accept-encoding" => "gzip, deflate"}, basic_auth: { username: "#{user}", password: "#{pass}" }, } response = HTTParty.get("#{host}/api/templates?filter[]=vendor=%27#{vendor}%27&filter[]=name=%27%2A#{os}%2A%27&filter[]=ems_id=%27#{ems}%27", options).to_s link = JSON.parse(response) link["resources"].each do |r| $url = r["href"] end response = HTTParty.get($url,options).to_s template = ["#{JSON.parse(response)['id']}"+", "+"#{JSON.parse(response)['name']}"] return template end

Using a POST request and specifying filters for the search, we get the desired template.
In addition to solving internal problems, you can create new API methods for use by external systems. At the beginning of the article, the process of ordering a new virtual machine using the web interface was shown. And this is how it looks if you do it with

POST request:

 curl -X POST \ http://Manageiq.hostname/api/service_catalogs/4/service_templates/31 \ -H 'Authorization: Basic Token-Value' \ -H 'Content-Type: application/json' \ -d '{ "action": "order", "resource": { "radio_button_vcpu": "a_2", "radio_button_vram": "a_2", "hdd_size": "40", "dropdown_os": "CentOS", "text_box_filter": "dns", "dropdown_list_information_system": "DNS ", "text_box_validator": "OK (DNS )", "textarea_box_usernotes": " ", "dropdown_env": "production", "date_control_retirement_dt": "2022-05-21", "dropdown_scope": "-" } }'

Conclusion

Pros:

Incredible flexibility: ManageIQ not only allows you to customize the automation process as you need it, but also makes it possible to change its visual part by adding additional buttons, fields, etc.
Built-in code editor with syntax highlighting and code validation. It seemed to me a very good solution, if you need to quickly fix something.
A large number of sources that the system can work with. Clouds: Amazon EC2, Google Compute Engine, Azure, OpenStack, VMware vCloud. Infrastructure: Microsoft SCVMM, OpenStack Platform Director, Red Hat Virtualization, VMware vCenter. Containers: Kubernetes, OpenShift.

Minuses:

The great capabilities of the tool also carry a negative point. Not all documentation is well structured, and sometimes it’s hard to figure out where to look for what you need. However, it is worth noting that the situation is changing for the better, the documentation is supplemented and improved.
Small community. If you encounter some very specific problem, you may not be able to quickly “google” the answer. Or not succeed at all.
A paragraph that follows from the previous two. Some basic things, settings and scripts can be found in the documentation or on the Internet, but more specific and narrow questions required a lot of time to comprehend and study, including the method of scientific poking: smile :.

As we have now:

Due to the fact that ManageIQ can take full advantage of the Ruby language, we were able to integrate it to work with the following APIs:

Password Manager It generates a root password in accordance with the requirements of the security service, writes it to its database, and ManageIQ uses it inside the OS;
Service Center Orchestration services for managing DNS records and host names;
BMC Remedy. The whole process is recorded as comments on the request. After successful execution, the request is closed;
CMDB Information about new configuration units is created in the database with all the necessary data.
Zabbix Depending on the affiliation with the information system and environment, hosts are added to the corresponding monitoring groups.
Rancher. Created the creation of new environments, the installation of agents and registration of hosts in existing environments.
Jenkins Jenkins runs jobs to configure VMs in oVirt;
LDAP Creating new groups that are used to control access in Rancher environments and to configure policies in Vault;
Vault In Home, the integration of this product into banking processes has just begun, but we have already made methods for creating new groups, policies and sections for storage;
Puppet and IPAM were mentioned earlier.

The functionality and capabilities of the system are very extensive, and I got acquainted with many of them and continue to get acquainted in the process of implementing the system.
For example, I did not mention that the system has the ability to create your own dashboards with statistics, billing settings or buttons, to which you can attach individual scripts or entire scripts. You can add your own fields to record additional information about services and virtual machines, etc.

What Home strives for:

An upgrade to the Hammer version, in which in HA mode you can try to work with the built-in Ansible.
The transition from coordination for each unit of virtual resources to management. Teams will be able to receive new VMs even faster if the quota is not exhausted.
Development of new methods for further provision to external systems.
For example, various SaaS like jenkins, logstash, etc.
Implementation of new API methods in an existing portal for owners of information systems. Users will not need to think about how to integrate with a new infrastructure element, they will just use it as a service to obtain new resources or change existing ones.

At the very end, I would like to remind you that tools are great, but do not forget about the importance of interaction between different teams. The changes described in the article would not have been possible without well-established communication and constant interaction on emerging issues of all interested parties.

Source: https://habr.com/ru/post/461891/

All Articles