Hi, Habr! The time of the long-awaited post about the internal structure of
Vexor - cloud-based continuous integration for developers, allowing to effectively test projects and pay only for the resources that are actually used.
StoryThe project emerged from the internal developments of the company
Evrone . Initially, we used Jenkins to run the tests. But when it became obvious that the CI service needed to be refined for our specific tasks, we changed it to GitlabCI. There were several reasons for this:
- GitlabCI is written in Ruby, which is the command's native language.
- it is small and simple, which means it is easy to refine.
During use, as is often the case, GitlabCI is quite mutated. And at that moment, when he already had little to do with the original, we just rewrote everything from scratch. This is how the first version of Vexor appeared, which was used only within the team.
')
At Evrone, several projects are being developed in parallel. Some of them are very large, and you need to run a lot of tests with each commit. So, under the workers constantly need to keep a lot of servers ready. And accordingly, pay for them in full.
But if you think about it, you understand:
- At night and on weekends, workers for tests are not needed at all.
- If the team is large and the process is designed in such a way that many commits are being done at the same time, then parallel workers are needed very, very much. For example, if weekly iterations are used, then usually at the end of the iteration several features are released and at the same time 5-20 pull-requests are made, each of which is accompanied by a test run. There is a situation in which you need, for example, 20+ workers.
Obviously, workers need to be lifted and removed automatically, focusing on current requests.
The first version of autoscaling was written in a couple of hours based on Amazon EC2. The implementation was very naive, but even with it, we immediately dropped the checks for the use of servers. CI began to work much more stable, because we excluded the situation when a sudden influx of tests led to a shortage of workers. Then the integration with the cloud was redone several times.
Now in a cloud the pool of servers keeps. This is managed by a separate application to which the workers are connected. The application monitors their status: live / fell / failed to start / no work. The application automatically changes the size of this pool, depending on the current state of the workers, the size of the task queue and the approximate estimate of the time spent on their execution.
Initially, we used Amazon EC2 as a cloud. But on Amazon, the disks that connect to the servers are not physically located on the host, but in a separate storage that is connected over the network. When disks are used extensively (and the test run speed is very dependent on the speed of the disks), the speed will rest on the bandwidth of the channel to the storage and the allocated bandwidth. Amazon solves this problem only for some money, which did not want to be paid at all. Considered other options: Rackspace, DigitalOcean, Azure and GCE. Comparing them, we stopped at Rackcspace.
ArchitectureNow a little about architecture.
VexorCI is not a monolithic application, but a set of related applications. For communication between them is used mainly RabbitMQ. What in our case is good rabbit:
- He is able to message acknowledgment and publisher confirm. This allows you to solve a lot of problems, in particular, allows you to write applications in the popular “Let it crash” style in Erlang. That is, if any problems occur, a crash occurs, but as soon as the service returns to normal operation, all tasks will be completed and none of them will be lost.
- RabbitMQ is a broker with the ability to build spreading topologies of queues and exchange points, as well as to set up routing between them. This makes it possible, for example, to easily test new versions of services in the production environment on current production tasks.
- RabbitMQ works steadily with large messages. The record for today is 120Mb in one message. VexorCI does not have the task to process millions of messages per minute, but the message itself can weigh tens of Mb or more (for example, when transferring logs).
RabbitMQ has some known flaws, which also have to deal with:
- It requires a perfectly functioning network between the client and the server. Ideally, the server should be on the same physical host as the clients. Otherwise, rabbit customers will behave like canaries on a submarine: to fall if there are any problems that no other service sees.
- C RabbitMQ is difficult to ensure high availability. There are three solutions for this, but only federation and shovel provide real high availability. Which, unlike cluster (about which you can read here ), is not so easy to integrate into the existing architecture of the application, since they do not provide data consistency.
Since our servers are physically located in several data centers, and the pool with workers, in case of any problems with Rackspace, can switch to another data center, we use
federation to ensure stable operation of RabbitMQ.
LogsSOA-architecture entails another difficulty: collecting logs becomes a non-trivial task. When you have only a couple of applications, you can not think about it: the logs are on a few hosts, which you can always log in and get what you need. But when there are many applications, and one event is handled by several services, a single service is needed on which the logs will be stored.
In Vexor, the elasticsearch + logstash + logstash-forwarder bundle is responsible for this. All logs of our applications are written at once in JSON format, all events from applications are logged, and also PG, RabbitMQ, Docker and other system messages (dmesg, mail and others) are logged in addition. We try to write everything to the maximum, because the workers only work for a certain time. After shutting down the server with the worker, we will not learn anything about the problem, except for what is stored in the logs.
ContainerTo run tests with workers, we use Docker. This is an excellent solution for working with insulated containers, which provides all the necessary tools. Now Docker is very stable and delivers minimal problems (especially if you have a fresh OS kernel). Although bugs are also found, for example,
like this .
Tests in Vexor are run in a container based on Ubuntu 14.04, in which popular services and libraries necessary for work are preinstalled. Full list of packages and their versions can be found
here . We periodically update the image, so the set of pre-installed software is always fresh.
In order to use one image for all supported languages ​​and not to make the image size too large, the necessary language versions (Ruby, Python, Node.js, Go - we have a complete and up-to-date list of supported languages
here ) from the packages when launching builds. This rather quick procedure takes a few seconds, and this solution allows us to easily support a large set of language versions without overloading the image.
We rebuild the deb packages for the image at least once a week. They are always available in the public repository at “https://mirror.pkg.vexor.io/trusty main”. If, for example, you use Ubuntu 14.04 amd64, then when you connect it you can immediately get 12 versions of Ruby, already compiled with the latest versions of the bundler and gem, fully ready for installation.
In order not to do apt-get update when installing packages at runtime and use fuzzy matching for versions, we wrote a
utility with which you can very quickly install packages of the necessary versions from our repository, for example:
$ time vxvm install go 1.3
Installed to /opt/vexor/packages/go-1.3
...
real 0m3.765s
ConfigurationIdeally, Vexor himself understands what he needs to run to run your project. We are working to automatically recognize which technologies you need and run them. But this is not always possible. Therefore, for unrecognized cases, we ask users to make a configuration file for the project.
In order not to reinvent the wheel, we use the configuration file .travis.yml. Travis CI is one of the most popular and well-known services today, which means it’s good if users experience the least amount of difficulty when switching from it. After all, if in the root directory of the project there is already .travis.yml, then everything will start instantly. Well, the team will get the joys of fast CI for our modest per-minute rates :)
ServersWe administer many servers on which many tasks are performed. Therefore, we actively use various tools, such as Ansible, Packer and Vagrant. Ansible is responsible for the allocation and configuration of servers and perfectly performs its tasks. Packer and Vagrant are used to build and test Docker images and servers with workers. VexorCI itself is used to automate image assembly, which automatically reassembles everything you need.
Who is our project for?Small projects that run tests are not so much and often, but at the same time do not want to pay a lot and think about system administration and deployment, enjoying the pleasures of continous integration.
For large projects where there are a lot of tests, we give an unlimited amount of resources, we are able to parallelize tests and speed up their run at times.
For projects with a large team, we solve the problem with "queues for the miscalculation of tests." Now, any number of tests can be run at the same time, eliminating long waits.
Friends, in conclusion, we want to invite everyone to our
Vexor . Connect your projects and enjoy the benefits.