Devops in a bloody enterprise

Here you can strive for this

We have more than 350 of our software developers and testers across the country, plus we often interact with customers' engineers and developers. To move to the practical use of devops, we needed to ensure not only the introduction of the methodology, but also to teach our favorite Russian customers to some basic culture. Just a couple of dialogs to understand:

- Why did we all fall?
- Because you skated it on the stand, tested everything, and then launched it on the prode. Here you have a setting that did not get into the instructions, and lived only in the head of the old admin.

Or:

- Why not run across the country?
- Because you have dozens of different regional installations, each done by hand, and each has different configs. And in a couple of cases the engineer was wrong.
- Correct till tomorrow? Very necessary! Only remote access, we will not give you.
- ..! Of course, we have a team of highly paid specialists who adore traveling to the Far East. No problems.

Culture-multur

- What is with the test stand?
- He is busy demonstrating for the financial director.
- Where should I test?
- Well, roll yourself a little the same, then roll in a private cloud.
- Two hours on the stand. I just adore you!
- Stop, do you have a template?

Or here:

Parallel contractor on the project:
- Hello, dear government customer, and we brought you a binary!
We intervene:
- No-no-no, with an expensive government customer repository, build, everything. Add the code here, check it in, see how it unfolds ...
- Oh! WE DO NOT WORK THAT!

- How many, say, you have admins?
“Well, about a hundred, who counts them.”
- Why so much?
- The contour is large, but the automation has not been done yet, something is expensive ... Although, stop!

In general, by basic culture I understand something like the following:

The infrastructure from raising in the cloud to the casting of system software should be obtained in the form of a code.
Spill a successful build from the build server automatically through the process, without touching anything with your hands (that is, uniformly, just controlling the settings in particular cases). Well, you can press the "Start" button with your hand.
It is convenient to visually monitor the movement of all assemblies along the delivery pipeline, from checkin to automatic tests run.

As a result, time and money will be saved. And when receiving the code from the contractors it will not be such that after two years the expensive customer will find that the source disk, folded on the rack, has nothing to do with what is spinning on the server, and the contractor has sunk into oblivion.
')

Planting this culture

This is pain, snot, tears and other dramas. Often, communication takes place approximately like this:

- Yes, yes, cool. I read posts on Habré, I know how cool it is. Automation is needed.
- So let's introduce?
- No, that you. We have Documentum not the latest version.
- So what?
“And Java is old ...”
- Look, already on five projects such a conveyor has been run in. It is with your software configuration. Including Alfresca and Websphere.
- Our deadline is on.
- Well, when will it stop burning?
- Well, for two years now it has probably been burning for another year somewhere. Then we introduce.
- Do you understand that you have no free engineers and you can really accelerate the project?
- Yeah, and what do you want?
- Download the finished role, connect to Ansible - and everything will unfold. Here is the engineer who will help you and set everything up. And he will teach you.
- Well, let me review your proposal in January at the meeting ...

Until last year, we strictly only needed Continuous Integration in our development projects (build on TeamCity / Jenkins, code analysis, unit tests). The full transition to continuous delivery and continuous deployment at the company level was complicated by the variety of used infrastructure, technologies, and the nuances of working with external customers. So until last year, we didn’t have a single standard and set of recommendations for the engineers involved in the operation (the maintenance services of the customers), each working with his own stack.

Few people understand how automation works. Some believe that it is expensive and difficult - "We will not take risks on this project." But for example - after several teams had run their stack inside KROK, many in practice realized that the saved time and human resources are much more realistic than they seem. And from that moment it started. Therefore, if an engineer from a past project receives a task from the manager “Deploy me a dozen servers manually”, then he has every right to suggest going towards automation, and as a manager I will be on his side.

Since we are not accustomed to implant culture with administrative measures, we arranged a survey, selected tools and tested on the most intricate projects with various technological stacks (integration projects on the Websphere, ECM projects on Documentum and Alfresco, custom development on java & .Net). As a result, we got satisfied guys (and girls), reinforced concrete arguments for staunch pessimists and a base of developments.

Next is a review history of tools and implementations. If you know everything - go straight down immediately, in the “Tools” section it is unlikely that something interesting for experts will meet, this is more educational.

Instruments

350 developers work with their technology stacks, their zoos - plus there is the Amazon EC2 API, and Openstack with VMware in the CROC virtual laboratory. We also had a bunch of scripting tools on the lower level. But at the top it is not clear that where it fell off, where is which build, what the pipeline looks like and what is happening. I needed a visualization tool.

We faced a difficult task of assembling a comprehensive solution for automating the supply pipeline from free, free software, which in turn would not be much inferior to the industrial proprietary solutions of this class.

The system architecture of this class consists of three components:

• At the top level - a solution for automating and visualizing the application's delivery pipeline. This is primarily a web interface in which the end user can follow all the stages of the release - from the initial build of the package from the source code to the passing of all tests, restart any stage, see the logs if the process has collapsed.
• Solution to automate the deployment of infrastructure. This is a kind of orchestrator, whose task is to create an infrastructure for our application - virtual machines in a particular cloud provider. In this case, the component is responsible for creating the basic infrastructure, the next component is responsible for fine-tuning and configuration of the software.
• A configuration management solution needed to set up the infrastructure already created, such as installing system software, additional applications needed for the release, configuring the database and application servers, initiating the execution of various scripts and directly installing the release of our application.

For each of the blocks, we analyzed the most relevant and sought-after tools that are suitable for us. Consider each of the components in more detail, let's start from the top level.

What we wanted from the tools and what we saw

The main criteria for the selection of tools (besides the fact that it should be open source) were:

• universality in terms of integrations: the tool must support all the main source code control systems used on projects (git, mercurial, TFS), must be friendly with CI Jenkins and Teamcity, be able to work with Selenium and JMeter testing systems, must be easily customizable interaction with open-source infrastructure deployment and configuration management systems;
• the ability to export / import customized pipeline and settings into a text config (Pipeline-as-a-Code);
• learning curve: it was ultimately required to train ordinary tool implementers to find the tool. The tool should be easy to learn and configure.

We started with Jenkins : as CI, it already stood and functioned. Would he come to us as a CD and Pipeline Automation solution? As it turned out, no, despite the huge number of plug-ins and integrations, and here's why:

• The Blue Ocean project does not imply an administrative interface yet, you can see beautiful pictures of the release through the stages, but you cannot restart a specific step from the same interface;
• the pipeline itself must be written in the form of source code, and the language of writing the declarative pipelines is still damp - difficulties in administration and training;
• along with Jenkins, we exploit Teamcity, for some of the projects it is more convenient.

Due to these reasons, for reasons of stability (there is a stable, functioning Jenkins, which did not want to be loaded with additional plugins) and functional separation (after all, Jenkins, like TeamCity, we use as CI servers for directly building software), we continued the study.

Hygeia , the project of the developers of the banking firm CapitalOne, attracted us primarily by an informative user interface, in which, among other things, you can see statistical information on commits and contributors, a summary of the sprint tasks, a list of features implemented in the sprint. Nevertheless, despite the exhaustive web ui, the tool turned out to be highly specialized for the tasks of the creators.

Under our criteria, Hygeia did not fit due to the lack of integration we needed, insufficient documentation at the time of the study, and because of the difficulties with the extension and customization of the tool's functionality.

Next, we drew attention to oncourse CI , created by the Cloud Foundry Foundation community primarily to support projects related to Cloud Foundry. Concourse CI is suitable for very large projects with the need to deploy and test several releases per minute, the interface allows you to evaluate the status of several application release branches at once. For our tasks, Concourse did not fit because of the lack of flexibility and versatility we needed in terms of integrations, plus we needed a more detailed picture of the display of the passage of different versions of releases of a particular application branch (in Concourse, the overall picture shows only the latest versions).

We last reviewed the GoCD project. GoCD has been leading its history since 2007, in 2014 the source code was uploaded to the community. GoCD met our criteria for universality of integration: everything that we needed in GoCD was supported out of the box or it was easy to configure. Also in GoCD there is an implementation of the Pipeline-as-a-code paradigm that is important for DevOps practices.

Among the minuses can be noted:

• non-intuitive nesting of entities for quick immersion: it is not easy to immediately deal with all the stages, jobs, tasks, etc .;
• the need to install agents, and for scaling and multithreading, the installation of additional agents is required.

With the top level decided, consider the remaining two blocks

The main criterion for choosing infrastructure deployment tools was support for heterogeneous environments of well-known cloud providers: AWS, Openstack and VMware vSphere. We looked at Bosh, Terraform and Cloudify.

Bosh , like Concourse CI, was created by the Cloud Foundry Foundation community. Powerful enough tool with uneasy architecture, among pluses it is possible to note:

• the ability to monitor deployed via Bosh services;
• the ability to recover from failures of deployed services, both automatically and by button;
• object repository for deployable components;
• built-in versioning control functionality of the deployed components.
However, the rich functionality is provided by the proprietary formats of virtual machine templates, for this reason and because of the total complexity of the Bosh tool, we did not fit.

Terraform is a well-known Hashicorp company tool in the community. Like other free products of the company - architecturally is a binary file. Easy to install, easy to learn: it is enough to describe the infrastructure necessary for the deployment in a declarative format, check how all this will be deployed with the terraform plan command, and run directly with the terraform apply command. For industrial use, the lack of a free Terraform may be the lack of a server part, aggregating logs, monitoring the status of running processes (in other words, the lack of centralized management and auditing). This moment is solved in the paid Terraform, but we were interested only in free tools, so in some cases besides Terraform we decided to use Cloudify.

Cloudify is an infrastructure deployment tool created by GigaSpaces. A distinctive feature of the tool is the de facto support in IaaC declarative templates of the OASIS TOSCA standard. In terms of its functionality, Cloudify is close to Bosh: it is possible to monitor services deployed from Cloudify, there is a recovery process after failures, as well as scaling of the deployed service. Unlike Bosh, Cloudify has an intuitive web-based interface, Cloudify uses VM templates standard for a cloud provider to deploy infrastructure. The functionality of the tool is extended by plug-ins, it is possible to write your own plug-ins in Python. But I must say that most of the graphical interfaces in the latest version of the product has been transferred to the paid version.

So, to deploy the infrastructure, we chose Terraform at the expense of simplicity, speed of learning and implementation, and the free version of Cloudify due to its functionality and versatility. We quickly made an OpenStack-compatible API for the virtual lab, which made it friends with Terraform.

It remains to determine the solution for configuration management. We considered the big four: Ansible, Chef, Puppet and Salt.

Puppet and Chef are the “oldies” on the configuration management tools market (the first versions were released in 2005 and 2009, respectively). Written in Ruby, well scaled, have a client-server architecture. For our tasks, they didn’t work primarily because of the high threshold of entry: I really didn’t want to train the introduction engineers for the Ruby language, which also contains configuration templates.

Salt is a newer solution written in Python. Configuration templates are already in a convenient YAML format. An interesting feature of Salt is the possibility of both agent and agentless ways of interacting with managed machines. Nevertheless, Salt's documentation is poorly structured and difficult to understand, we wanted to work with an even simpler tool for new users.

And there is such a tool - it is Ansible . This tool is easy to install, easy to use, easy to learn. Ansible is also written in Python, configuration templates are in YAML format. Ansible has agentless architecture, clients do not need to be installed on managed machines - interaction is performed via SSH / PowerShell. Of course, in an infrastructure with a capacity of more than a thousand hosts, an agentless architecture is not the best choice due to possible speed drops, but we didn’t have such a problem: the infrastructure of our projects is much smaller, Ansible is perfect for us.

Decision

As a result, we received the following set of tools for automation of the DevOps pipeline.

Under the hood, Ansible and (depending on the case) Cloudify or Terraform. At the top level GoCD.

Further we will tell about practical application of the selected tools on one of our projects. The case of using the DevOps practice for this project was non-trivial. Development was conducted in several mercurial repositories; there was no centralized release repository. Separate attention should be paid to the technical documentation for software deployment, which was provided to Release-engineers (specially trained people who carried out manual installation of updates). To get to the bottom of the truth, it was necessary to collect information bit by bit from word documents, Confluence records and Slack messages. Sometimes the documentation was more like fiction with fantasy elements. Quote:

Part Two is the gate of hell. Filling DB.
2.1 In this part, we are waiting for especially a lot of difficulties and trials, be attentive. First, let's connect to the database under the user SYS with DBA rights
2.2 We read spells We execute scripts.

After auditing and unifying all operations, the developers managed to refine the documentation and present it in a convenient format that is understandable for both humans and configuration management systems.

The development was carried out under several stands with minor changes. The test environment contained the most complete CI / CD pipeline, which can be divided into 7 stages:

1. Jenkins release build

The customer used Jenkins before us, however, he did the assemblies locally on the developer's machine, and he transmitted releases in the form of archives. We told that it would be nice to transfer the assembly to a centralized Jenkins server, and save artifacts in the Artifactory. This gave us the opportunity to have end-to-end indexing of all artifacts, the ability to quickly install the necessary version, test individual blocks of the script and see what changes resulted in errors in the script:

2. Deploying infrastructure

We transferred the test environment to the KROK virtual laboratory and stopped keeping it in permanent condition. At the request of the tester or automatically after the appearance of a new release in Jenkins, the environment is raised based on the Terraform manifest from VM templates preloaded in the Openstack of our virtual laboratory CROC.

In the form of output variables, we obtain the addresses of the stands and use them further in our scenario.

3. Configuring stands

The most time-consuming part of the pipeline was to develop a mechanism for automated configuration of the stands.

It was necessary to configure the Oracle database, configure the IBM MQ bus and the IBM WAS application server. The situation was complicated by the fact that all these settings had to be done on the Russian-speaking Windows (sic!). If everything is more or less transparent from the database, no one has configured MQ and WAS before us in the project from the CLI or other automated means, but could only show where to click in the interface to make it work.

We managed to write universal playbooks that in 11 minutes do what the engineer needed a couple of days.

4. Install basic applications

To install and update applications in WAS, the customer ran bat-scripts, we didn’t go into them much and embedded them in the original form. We check the exit code, and if it is not null, then even then we look at the logs and figure out why this or that application did not take off. Log execution script stored in the form of script artifacts:

5. Installing the user interface application

This stage is very similar to the previous one and is performed from the same pipeline template. The only difference is in what configuration file is used in the bat-script. However, it was necessary to allocate it in a separate stage for the convenience of updating, since not all stands need a user interface.

6. Testing

For testing applications used different tools (Selenium, JMeter). The script allows you to vary several test plans, the choice of which is set in the form of variables at this stage. Following the results of testing, reports are generated, for which we have made a custom tab. Now it’s convenient to view them directly from the GoCD interface:

7. Removal of infrastructure

According to the results of testing, the responsible persons receive notifications and decide whether to let this assembly into the prod or require changes. If all is well, then test benches can be removed and do not waste the resources of the CROC virtual laboratory.

The result was the following logic for using the DevOps pipeline for this project:

And these interfaces:

Stream map (release movement from commit to version control system before tests are performed)

Visualization of the statuses of the stages of one script

Detailed information on the stages of the script execution (logs, history of changes in the script, the execution time of the stages)

Flow Map (another project, more compact)

General view of the script from the 2 pipeline

Another example is in the picture before the kata a general view of the script from the 7 pipeline.

Following the implementation of the new supply pipeline, it was important to get feedback from the team (without banknotes):

“Significantly (from days to hours) the time for creating new stands has decreased. There was an opportunity not to hold several stands for different versions. If necessary, the required version rises from scratch, and the stand itself is guaranteed to correspond to its state at the time of the release of the required version, since the DevOps code is versioned along with the project code. The quality of the ex. documentation. Since the DevOps code clearly repeats instructions, there are no “implied” things left in them, etc. The number of routine operations for engineers has decreased.

The ability for testers to create test benches for complex cases (load, fault tolerance) with minimal involvement of engineers. The first step to managing the infrastructure of the project is versioning, testing, etc. It’s fast, convenient, and incomprehensible. ”
It is not clear, it is unusual - quite frequent words, so we periodically conduct training for accepted instruments.

And this is the feedback from the project for the energy company:

“Automation of long routine operations. Saving about 0.5–1 people per week. And during the periods of preparation of the release, in fact, 50% of the time, one engineer used to do assembly-updates. Accelerating the update process. The time for maintenance work on updating all sites (7 pieces) is on average 3 hours - automation of updating + autotests + making sure with eyes (that is, in fact, overtime is minimized). Organizing the storage of information about all environments, a guarantee of its relevance and timely updating. The introduction of night auto-warming and auto-testing - an increase in the transparency of the state of the product. Improving the quality of the product. Virtually no need to release hot fixes. Release costs have decreased. ”

What got

Operational information on the quality of the assembled build, a visual representation of the delivery conveyor.
Always have the assembly in the standard distribution repository.
The ability to quickly see the current changes on the stand.
The ability to program the infrastructure (infrastructure as a code) in all our variety of virtualization.
The ability to quickly deploy the right infrastructure and application system of the desired on demand version.
Typical Ansible roles for configuring generic projects.

It was necessary to suspend work on some project tasks in order to adjust the infrastructure. The earlier devops will be implemented, the less time is spent. Otherwise, you have to implement during periods of low activity in the project, and when they are?

We continue to gently communicate within the company and around the need for automation. Politely. Neatly.

References:

Funny bike our ops a year before the centralization of the whole thing
Here are the results of the project in the nuclear power industry, where we used this stack
About our development centers across the country (by the way, since then they have added centers in Voronezh and Chelyabinsk).
My mail is sstrelkov@croc.ru.

Source: https://habr.com/ru/post/345282/

All Articles