From 15 and more: how to ensure the scalability of CI

Now there are many articles and reports about specific technologies in DevOps: Docker, Kubernetes, Ansible ... But I want to talk about building processes and how we have evolved from release system for 15 front-end developers to almost 60 in Wrike over two and a half years -ti, and 2-3 deployes per day.

This article is about the lessons we have decided on this path. The article is based on my report for the DevOps mitap at the Wrike Tech Club . If there is no time to read, there is a video of the presentation . Readers, welcome under cat.

Little introductory

The task of any product company in a very competitive industry is to find the coolest solution and make it faster than its competitors. That is, you need to go through the options, experiment, quickly release, be aware of mistakes, invent even faster and release even faster. Accordingly, the speed with which you test your product ideas is your competitive advantage. That is, all continuous integration, uninterrupted delivery, and other tricks and tricks are needed in order to overtake competitors. The DevOps department at Wrike has exactly that.
')
We took it in 2015, the Front-end then consisted of 15 people, and the deploy looked like two zip-files. They could roll out only together and unpacked with the help of the bash with the warm skillful hands of the system administrators.

And now we have 12 SCRUM teams. They are multifunctional and independent. In total, these 12 teams already have more than 60 front-end vendors. And our product, those two zipnics, is now 60 Git repositories. We deploy one to three times every day. And this is only a frontend.
Immediately mention that we are writing frontend in the language of Dart . And it has its own runtime, written in C. It is a strongly typed language, the syntax most similar to C # and Java. There is a compiler, its own dependency resolver system and its own libraries. About how we live with Dart, you can read in other articles , we return to our topic.

Stage 1

So, in 2015, we had no automation — one repository on Mercurial and no collaboration on the code base. That is, no Hitlab, it was impossible to discuss some kind of diff, patch, it was impossible to get the issue in a merge request, discuss it and kill it. Model of work with VCS, version control system was not formalized.

It does not matter when you have a team of five people, and people just informally agree: "Let's come out with a release with you." Who in whom merzhitsya, agreed, let's go. The next day is new faces. Agreed again. But such a model cannot be scaled. And it is not subject to evolution.

The first version cost us cheap and fast. We took UI Wrike, although it could be any issue tracker: Jira, YouTrack or something else. The main thing is that in it you can exchange comments and tag tasks with any tags or folders.

All we wrote is a stateless Python application with no persistent data layer at all.

In the end, I used TeamCity and Wrike and wrote a solution using tags. The developer imperatively put down the tags that the robot perceived as commands, and ran auto-tests, integrated branches, etc.

Since our product is SaaS, and we deploy a couple of times a day, we do not need long-lived release branches. You do not need to backport the patch there, you do not need to have a hotfix there. That is, from a complex Git-flow model, you can go to GitHub flow, and from it to an even simpler one. We made two models in the tracker, wrote out which features go into release, numbers, links, and those involved in the work. That's the whole first version of integration. In fact, we took data from two sources - Gitlab and Team City - and sent the data to Wrike through the API and TeamCity. All work with the front-end code base was in TeamCity.

Stage 1: lessons

We moved from Mercurial to Git, which helped us to scale up smoothly from 15 to 60 front-end developers. There are a lot of people on the market who somehow understand how Git works. You will surely break something in Git and will definitely learn how to fix it. Hiring people who understand how to work with Git, in your model, is three hundred times easier than in any other VCS.

The visualization of working with the code base in Gitlab came in handy for us. People began to work with Merge-Requests, and this helped to get a better code base. Visualization done in Wrike, very cheap and very fast.

Well, the specifics of SaaS and frequent releases played into our hands in working with branches.

Stage 2

Rolling out the first version, we immediately found one important flaw. In it, we used squash commits in Git. In Git, you can group commits and make one of them. Suppose a developer has made 5 commits, and we want to integrate this. Previously, we grouped these five commits into one with squash and thereby obtained atomicity. That is, any feature was one commit. You are either holding him down or not holding him down. That is, it was immediately possible to understand where people tried to merge before and dragged the code from each other. The story is beautiful. Each commit in the release branch is a whole feature. Read conveniently. Atomicity

But trouble came, from where it was not expected. It turned out that we have features that live for two or three weeks. What does this mean? That, while the master bumps every single day, someone aside is being developed. When there is a conflict with upstream, there is not a diff of two commits or one, but two huge sheets in front of a person. Squash commit against another squash commit. That is, blood is flowing from the eyes, and it is impossible to merge in general - a mountain of text.
Also, after we automated and formalized the process, the gears started spinning much faster. And master quickly began to run away from features that are in development. Accordingly, when a person starts a new feature or continues to do it, he does not have fixes from upstream, he can easily work in the state of presence of regression defects. That is, when the bug is fixed, but he does not have this patch.
As a result, we quickly had to add a mechanism that the master poured into actively developing branches.

Stage 2: lessons

We, it would seem, did the good to our colleagues, but thereby created two defects. As a result, we had to sacrifice these squash commits, although we really liked them. It was a very beautiful decision on paper, but if we sat a little longer with a pencil and a notebook in our hands, instead of programming, we could see these risks and prevent them in advance.

Stage 3

The third version was provoked by the challenges of the outside world. As I said, our product initiatives are connected with a quick check of ideas. One day, the product team came in and said, “We want to test ideas even faster.” We are talking about MVP (Minimum Valuable Product). The bottom line is that you create small product functionality to test a business idea and roll it out very quickly. If earlier you tested ideas consistently in your great product, now the task is to check many ideas at the same time. That is, we have two applications, but we need ten, or, most likely, 20.

Let me remind you that we have a Single Page Application. Accordingly, these are the bytes on the server that the client is downloading. For versioning artifacts we used HTTP param versions. Also, sometimes when synchronizing via rsync, old versions of artifacts remained on the server.
And it created the most wonderful bugs. The man pumped out half the wrike, for example. And part of the application was pumped out on demand. Two weeks later, he extorts the second half, but it is a different version. It was difficult to debug, it was impossible to reproduce at all.

We decided to repack everything completely differently. Using RPM, which has a built-in dependency resolver and hash check. That is, at any given time you can check on the set of your boxes that nobody has modified your files, they are equal to the original. Also in modern distributions there is a multithreaded incremental index repository. There are ready-made patterns and solutions for distributing artifacts around the world, indexing them and everything else. And signing GPG-keys.

How to solve this problem with two parts of Wrike that can be downloaded by people at an arbitrary point in time? Our Single Page Application is designed so that a person can work on the old version for a month, until he overloads the tab or opens a new one. How do we fix this? We began to refer to all artifacts on permalinka. And the version there is directly sewn into the link. That is, you either download the correct artifact, or do not download any.
In addition, we decided to keep the 50 latest versions of Wrike on the file system. And the current exhibit using symlinka. When people are simultaneously pumping an asset, we can change the version right on the go. Accordingly, for us the rollback of versions is the movement of the symlink. In seconds we can always roll back to the version we need. This is done quickly, simply. Bytes are already everywhere are needed.

There is also a positive side effect. We, in fact, can not only roll back, but also forward. That is, we can post on the version, which is not yet publicly available, but they can already be clicked, tested. That is, we can be ahead of the events.

This was achieved thanks to the fact that we store many versions and with the help of the symlink we manipulate the current implementation. Naturally, RPM does not know how out of the box. We wrote custom module Ansible in Python. For those who did not write modules for Ansible, I highly recommend trying it, it has a simple, compact API. You in one evening, after reading the official dock, learn how to write and debug them.

It is a little about Docker. His role in 2015 was extremely modest. We had a huge long-suffering collector who spawned processes. Sometimes I forgot about them. They allocated gigabytes of memory. We decided this with the help of isolation in Docker, because we were very tight in time - we could not fix the root cause. As a result, when container was terminated, our resources were released, and everything was great.

Stage 3: lessons

What is special about our third version? Remember, I was talking about a notebook and a pencil? We had a week. We realized that in the whole week we would not program it at all. Even if we do not sleep and eat. We spent a third of the time trying to think up this engineering solution with symlinks, RPMs, and atomicity. We agreed with the sysop - those who exploit our solution in production. Agreed with the developers. We just wrote an engineering spec, how it will be assembled, how it will be laid out.
Later our colleagues wrote the library on Dart, an abstract project collector. Now this library collects 60 projects approximately. We were right in this decision, because we barely touched the spec, and then rewrote the implementation. The main idea is that you need to think over the specs and the contract with other departments, if you change something strongly. And implementation at first can be donated.

The important point is not to be afraid to defend your point of view, to combine those solutions that give you the product properties you need. That is, it is now fashionable: no docker - not a kid. But the RPM is quite some kind of archaic old system, seemingly unnecessary to anyone. In fact, there are very cool chips there that many other systems do not have. I strongly urge you to think with your head and always choose what you need, and not go on about some opinion.

Stage 4

Remember, I was talking about Dart and libraries? We have created dozens of applications. In essence, these are consumers of library code. This triggered a wave of library creation. That is, we have, say, 10 applications frontend. Everyone needs a layer of work with data. The DAL (Data access layer) library has appeared, an abstraction for working with data, a library of components has appeared. And a huge number of other libraries. Tens. There was a task to perform the composition of this code.

That is, for example, our colleague is rolling out a feature. It rules three libraries and the library consumer code, some application where it is embedded. What's happening? He rules the code of the first library, goes through the procedure of a merge request, goes to the second, rules it, goes through the procedure of a merge request, goes to the third - the same. Then the feature is ruled by referencing these libraries. And once again undergoing the procedure of Merge-requester. It is very long, redundant, and very many communications.

On the one hand, we helped our colleagues, allowed them to reuse the code and quickly check product ideas. On the other hand, they laid a gigantic burden on their shoulders. The procedure has become modern though (this is all in Gitlab, through a merge request), but it is wildly difficult and unusual and inconvenient. The fourth version arose because we now increased the number of libraries, increased the reuse of the code, but it was necessary to do it effectively. That is, I needed some kind of response to the need.

We also have MVP. You check the business idea and then make a decision. For example, we have achieved the desired grocery properties. Now you can slow down. We do not know what to do with this grocery part. Maybe we will need it in a year, or maybe it will never be needed. We chose between a mono-repository and a multi-repository. And they decided to do the following. We have allocated each library or application to a separate repository with our readme, with our change log, our version and our tests.
Thanks to this, our applications link to specific fixed versions of libraries. If a component is holding, naturally, it depends on the old versions of libraries, then after a year it can be patched. For example, security can be superimposed without upgrading to new versions of libraries. If he decided to continue to develop in a year, he simply bumps up to new versions of libraries.

And if we left in the mono-pository and said, “Dudes, we want to guarantee compatibility from beginning to end,” then we would carry with us a huge baggage of applications that either suspended in their development or died. And it would be necessary to guarantee compatibility. That would be very expensive. In fact, using Git submodule and subtree can be made from a multi-repository mono-repository, if required by the task.
To implement our solution, we needed smart dependency management. We wrote it in Python, with a persistent layer on Postgre, from UI to Angular.

That is, the developer declared simply: I am writing a feature in which I manage three libraries and two consumers. And our application allowed us to do the following. Roll it all atomic unit into RC. And the robot itself went through the repositories, all mergil. Himself changed references. And in the case of a rollback feature, it all returned to its original position. If the tester confirmed that yes, it’s good, he gave a green light, and the feature went to RC. Here it is. We now want all this to happen on the basis of green tests, without any human participation.
At some point, we grew out of the possibility of Wrike as an issue tracker. We needed our UI, the guys washed it down on Angular.

But this is not the end. Wrike is growing: the team, the product, and new challenges are emerging. In the browser, Wrike is a kind of monolith, and from the point of view of DevOps this is not a joy to us. And the next big step for us is to decompose our solution and make many applications. This will give us a boost in integration, and we will roll out even more features. We also have a lot of automated testing. Your selenium farm was sent to Google Cloud, which allows you to chase away the whole regression in 20 minutes and $ 1.

General conclusions

I think it is critically important to hire the right people, invest in tools and work with the developers and business feedback.

Many decisions about how to develop the system at a certain stage - evolutionary or revolutionary - were made only through feedback. The DevOps department at the grocery company is lucky, its consumers are its colleagues, that is, the engineers with whom it should communicate easily, which it should be able to hear and listen to.

And the last thing - you don’t need to immediately build a spaceship for each task. Solve problems iteratively. I strongly recommend and advocate not inventing very long, long-lived and expensive solutions in implementation. That is, we all know about the internal services of large companies or about some open-source projects cool. And at the start we can say: “I know how to solve this problem, just need to write this, this and this.” And it takes a year and a half. The fact that we did not copy any other people's decisions, but made tools that meet current needs, allowed us to keep up the pace of work and save money.

Source: https://habr.com/ru/post/345140/

All Articles