
The process of creating a service is not limited to development and testing. In addition, there is also the operation of the service in the production infrastructure.
Thus, the project requires specialists with different competencies - developers and infrastructure engineers. The former know how to write code, the latter know how the software works and in what configuration it will be more efficient.
The article will discuss how we at 2GIS lined up the work processes in the Infrastructure & Operations team (9 people) and interaction with the development teams (5 teams). At first glance, everything is simple and logical.
But it is not. If you take these people and say, “You, developers, just write code, and you, administrators, please, be responsible for the stable operation of this code in production,” we will get a number of problems:
')
- Developers often make changes. Changes are often a source of failure. Administrators do not like failures, respectively, do not like change.
- Differences in terminology and experience greatly interfere with communications.
- Shifting responsibility in the event of incidents is not uncommon.
About our sheepIf the three problems mentioned above did not impress you, take a look at the list of issues our team faced:
- With communications everything is difficult. It was not accepted to ask for help and the tools were chosen without consultation, proceeding not from the task, but from convenience. It turned out that people came with a ready-made solution, often on the go, and not with a problem.
- Partially manual configuration of services. With the increase in the number of services, the amount of manual work increased. People were overwhelmed, the nervousness in the team grew.
- The uniqueness of services. Each project was unique as a snowflake, respectively, the incident handling time was very large - first you need to understand the service, then understand what the problem is, and then fix it. In addition, developers were often not interested in whether a PostgreSQL or Rabbit cluster already exists in the company and deployed its own.
- Rules of work of system administrators and task-tracking were not formulated and fixed. Fidbeck from the developers - "FIG know what they do there."
- Communication channels were not formalized. In the end, something is being written to the post office, something to Skype, something to Slack, something is negotiated by phone.

Together, all these factors led to an increase in product development time and nervousness in the team.
What to do?
PrincipleWe decided to act in the following way: to save developers from infrastructure issues to the maximum so that they focus only on planning and writing business logic.
The solution was in several areas - in technical and in the field of process organization. In technical terms, because the available solutions did not provide the necessary functionality. In the process - as a clear process only had to create.
Technical solutions
So, from the technical solutions in terms of infrastructure, we had OpenStack. What does he allow to do? Create all the necessary infrastructure for the project - a virtual machine, DNS, IP, etc. But at the same time, all the same, there remains the need to write a deployment of products, monitoring, logging, and ensuring fault tolerance. All this remains in the area of ​​responsibility of the development team and eats away her time.
PlatformWe decided to conduct an experiment - radically change the distribution of applications. Several small projects that were to be released in the near future, decided to release in Docker.
We wanted our developers not to get an extra headache, but to get a solution to their problems. Forced learning of new technologies was not part of the plans. In addition, in the experiment did not want to invest a huge amount of time.
The choice fell on the platform to run web-applications
Deis . As they say about themselves - MicroPaaS or a small clone of Heroku. Micro - because it allows you to run only stateless-services. Now Deis is no longer supported, but he has served us well in terms of adapting people to new technologies. In more detail about this platform and its technical aspects, my colleagues have already made reports
here and
here . I just dwell on the fact that our developers got it.
Deis gives 3 ways to run an application:
- You can collect Docker Image yourself and give it to Deis.
- You can simply write a dockerfile.
- The easiest option is the 'git push deis master' command. He collects Docker Image by Heroku buildpack.
The second version of Deis Workflow, which works on Kubernetes, has already been released. So we successfully upgraded and preserved the user experience and gave our more technologically sophisticated Kubernetes projects.
What did we get with the transition to container infrastructure?
- More efficient utilization of iron.
- Unified approach to the development of services. Now most of the services look the same and it's much easier to understand what is happening.
- For a single platform, it is now easy to make standards for logging and monitoring.
Backing servicesWell, we have a solution for applications, but in addition to the code itself there is also a database. Many projects have unique database instances with varying degrees of fault tolerance of different versions that take time to maintain. Here we understood that happiness absolutely for all we cannot do. Alas. We took the same type of projects - most of them - which do not require registration in several data centers. For them, they raised one large cluster of Postgres on hardware and gradually transported these projects there.
Now developers can simply deploy the code to the platform, re-request the base from admins and that's it. However, “may” does not mean that they will necessarily do it.
Technology diffusion
It is necessary to say a few words about the fact that, in fact, when a new technology appears in a company, it is not necessary that everyone immediately rushes to use it.
Firstly, the benefits are not always obvious, and secondly, everyone has a lot to do, thirdly, and so everything works.
In order to really start using the technology, for example, we did the following:
- Wrote documentation - FAQ, Quick start.
- We conducted a series of internal techtalks, on which we told exactly what development problems we are solving.
- In some cases, we sat directly on the team and dealt with emerging issues and problems.
IaC, CI and that's it
The introduction of new tools would not be possible if we did not use the IaC + CI approach. In our work we follow the following rules:
- All infrastructure must be in Git.
- Must be CI.
- We go directly to the servers only in extreme cases, everything else via CI.
- Each change must be reviewed by at least two colleagues.
Process solutions
Input and Output Review
To control the ingress of projects into our new infrastructure, we came up with the following process - a technical review of projects.
It consists of two steps.
The input review is at the design stage, that is, as soon as the team came up with what its service will look like.
- Objective: validate the architecture, selected technologies, discuss the required resources.
- Committee: product manager, infrastructure engineer, QA, regional experts and other interested people.
- At the exit: a list of tasks that need to be corrected before the exit review.
Output Review - a few days before the expected release date
- Objective: To test the readiness of the product for release.
- Committee: The same + technical support.
- At the exit: release date and list of joint actions.

It didn’t settle down right away, because the process was perceived as passing an exam or defending a project - the question immediately arose: “Why should I actually do this?”
We replied that it was not passing an exam to any particular department, but consulting with experts on how to make the product better and more stable.
This + a few more successful cases of rescue of the project from overin + the found unaccounted problems helped the technical review to take off.
Over time, the system calibrated itself and the process began to take less time. Additional usefulness has been product release awareness. We always know what will be released and when, as well as the architecture and weak points of all projects.
PlanningAgain, in order to know what is happening and what will be happening, we introduced a planning process, which was not previously in the team of system administrators. Made the same as for the rest of the development teams. Task tracking in Jira, monthly iterations, 30-40 percent of time, are laid on unexpected traces. We also participate in all big planning, where it is decided what work will be done in the next six months.
On-call rulesAnd another source of hassle was complete chaos in dealing with incidents. When a service stopped working, it was not clear who should call whom and what to do about it. So we introduced quite simple things:
Duty roster
By the principle of the queue. There is a Primary On-call, there is a Secondary On-call, there is everyone else. Every week there is a change - Primary moves to the end of the queue, Secondary becomes Primary. The duty officer’s task is to restore service availability as soon as possible. If the admin understands that the problem is outside his area of ​​competence, he reports this to the developers of the service.
Postmortem meeting or debriefingDuring working hours, the duty officer appoints a meeting with all parties involved. Participation of representatives of the affected service, responsible for hardware, technical support is obligatory.
The following aspects are documented at this rally:
- Chronology of events. When the problem was discovered and when it was fixed.
- Impact What impact the incident had on users.
- What was the original reason.
- What actions should be taken by each of the parties so that the incident does not happen again.
The resulting list of tasks is executed in priority. It seems nudyatina and bureaucracy. However, this is a very useful thing. Seriously helps in arguing and pushing tasks from technical debt.
CommunicationsIndiscriminate communication makes it difficult to understand what is happening and keep track of everything.
We had the following initial data:
- Existing channels: Slack, mail, telephone, Mattermost.
- Team information: problems, questions, feature requests.
- Outgoing information: work on services.
Now the processes are organized as follows:
Quick questions and problem reporting to the open channel on Slack so that everyone can see.
If this is a real problem, then after the message, a ticket is sure to start in Jira. Feature requests definitely go to Jira right away.
To inform the teams about the work in the infrastructure, its Status Board service was written. In it, a person starts an event for the service, which contains information: when will work be done, how much service will be unavailable, etc., and the Status Board sends the necessary notifications.
findings
Operation is an integral part of the service life. And to start thinking about how he will work in production, you need to start working on it from the very beginning. The stability of the service is provided by both infrastructure engineers and developers, so they must be able to communicate — sit together, sort out problems together, have processes transparent to each other, share responsibility for incidents, plan work. Don't forget to take the time and create a handy collaboration tool. Write a FAQ, swipe TechTalk, but make sure there is no misunderstanding.
Making friends with the team of administrators with the development teams is not easy, but the game is definitely worth it.
Video version of the article can be viewed on
techno.2gis.ru