Preface from the author of the original:In March 2011, I wrote a draft article about how the team responsible for Google Chrome develops and releases its product - after which I safely forgot about it. Only a few days ago I stumbled upon it. Let it be outdated in some places (Chrome forked WebKit in Blink in 2013, and I no longer work for Google myself), I’m inclined to believe that the ideas contained in it are still valid.
Today I am going to tell you about how Chromium works. No, it's not really about
Chrome , but rather about
Chromium - a group of people involved in creating a browser.
Over the Chromium project work hundreds of engineers. Together we commit approximately 800 changes to the code base every week. We are also dependent on many other large and actively developing projects like
V8 ,
Skia and
WebKit .
We send a new stable release to hundreds of millions of users every six weeks, clearly on schedule. And we support several other channels of early access, which are updated even faster - the fastest channel,
canary , “quietly” auto-updates almost every weekday.
')
How does all this work? Why are the “wheels” of this “bus” not yet “fallen off”? Why not all developers are crazy?
From a technological point of view, the speed of the Chromium team was achieved through reliable, efficient and “silent” auto-updates.
From the point of view of human resources, this is the merit of dedicated, hardworking and intelligent QA teams and release engineers, without whom the entire project would have fallen apart in a matter of weeks. And also - designers, product managers, writers, PR, lawyers, information security and all others who work together smoothly on each stable release. Today I will not talk about everyone, focusing only on engineering topics, so as not to slip into a giant post in the spirit of Steve Jegg.
I'm going to talk about the Chromium development process, which is built specifically to make quick releases possible. It will be a question of interesting finds which can be useful to other projects regardless of what their release schedule looks like; I will mention our difficulties.
No branches
In many projects, a common practice is to create a branch to work on new major features. The idea with this choice is that destabilization due to the new code will not affect other developers and users. Once the feature is complete, it mergitsya back to the
trunk , after which usually follows a period of instability while integration issues are eliminated.
With Chrome, this approach will not work, as we are releasing every day. We cannot allow the appearance of large chunks of new code in our
trunk , because in this case it is likely that the update channels
canary or
dev will go into denial for a long time. In addition, the
trunk in Chrome is moving forward so fast that it is impractical for developers to remain isolated on their branch for too long a period of time. By the time they shake off their branch, the
trunk will look so different that integration will be time consuming and easily error prone.
We create operational branches before each of our beta releases, but they live a very short time — a maximum of just six weeks, until the next beta release. And we never develop directly in these branches - all late fixes that should be included in the release are first made to the
trunk , after which cherry-pick will be made to the branch.
A pleasant side effect of this process: the project does not have a special
“second-class” development team that deals exclusively with the operational branch. All developers always work with the latest actual version of the source code.
Runtime switches
Suppose we do not create branches, but still need a way to hide unfinished features from users. The natural way to do this is to use compile-time checks; its problem is that this approach is not very different from creating branches with code - in fact, you still have two independent code bases that must be checked once. And, since the new default code was not compiled or tested, developers will not have difficulty accidentally breaking this code.
Instead, the Chromium project uses runtime checks. Each feature that is under development is compiled and tested on all configurations from the very beginning. We have
command line flags , which we test at the very beginning; in other places, the code base for the most part has no idea what functions are available. This strategy means that the work on new features is integrated into the project code from the very beginning to the maximum extent possible. At least, the new code is compiled, so all changes in the main code that need to be made for the new function to work have been tested, and the user believes that everything works as usual and does not notice the difference. Well, we can easily write automated tests that check the operation of unavailable features, temporarily
“overriding” the command line .
When the feature gets closer to completion, we present the option in the form of a flag in
chrome: // flags , so that advanced users can start testing it and give us feedback. As a result, when we think that the feature is ready for release, we just remove the command line flag and make it available by default. By this time, the code is usually tested far and wide, as well as run-in by many users, so that the potential damage from its activation is minimized.
Huge amount of automatic testing
In order to be released every day, we must be sure that our code base is always in proper condition. This requires automated tests, and a very large number of them. At the time these lines are written, Chrome has 12k class-level unit tests, 2k automated integration tests and a huge range of performance tests, bloat tests, thread safety and memory safety tests, and possibly many others, about which I do not remember now. And all this for just one Chrome; WebKit, V8 and the rest of our dependencies are tested independently; One WebKit has approximately 27k tests that verify that web pages are displayed and functioning correctly. Our basic rule is that every change must go along with the tests.
We use a public
buildbot , which constantly
rolls out new changes in our code on the
test suite . We adhere to the "green tree" policy: if the change "breaks" the test, then it immediately rolls back, and the developer must fix the changes and re-distribute them. We do not leave such “critical” changes in the tree, because:
- This makes it easier for you to accidentally make even more “critical” changes — if a tree is red, no one notices when it becomes even redder.
- This slows down development, as everyone will have to work on what is broken.
- This encourages developers to do careless quick fixes in order to pass tests.
- This does not allow us to be released!
To help developers get away with the tree, we have
try-bots , which are a way to "get rid" of the change under all tests and configurations before releasing it. Results are sent to the email developer. We also have a
commit queue , which serves to test the change and apply it automatically if all the tests pass successfully. I like to use it after a long night of hacking - I press the button, go to bed, and after a while I wake up with the hope that my change has affected.
Thanks to all the automated testing performed, we can get rid of the minimum amount of manual testing on our
dev channel, and not do it at canaries.
Ruthless refactoring
Since we have a fairly extensive test coverage, we can allow ourselves to be aggressive in refactoring. In the Chrome project, all the time we are working on refactoring in several main areas - for example, for the period of 2013, those were
Carnitas and
Aura .
With our scope and pace, it is crucial to keep the code base clean and easy to understand. For us, these qualities are more important than preventing regressions. Engineers of the entire Chrome project have the right to make improvements anywhere in the system (however, we can also request a mandatory review of the module owner). If as a result of refactoring, something eventually breaks down, and this cannot be detected at the testing stage, then from our point of view this is not the fault of the engineer who did the refactoring - but the one whose feature was not sufficiently covered with tests.
Deps
WebKit is evolving at an equally fast pace. And, similarly to the fact that we cannot afford to have branches for features that will suddenly merge into the mainstream one day, we cannot afford to try to keep the monthly set of changes in WebKit all at once, because it destabilizes the tree for several days.
Instead, we try to compile Chrome with the latest version of WebKit (almost always, this version is not older than half a day) until we achieve success. At the
root of the Chrome project is a file that contains the version of WebKit with which it is now successfully compiled. When you checkout and create a working copy, or update the Chrome source code, the
gclient tool automatically gets the version of WebKit specified in the file.
Every day, several times a day, the engineer updates the version number, finds out if new problems have arisen during the integration, and assigns bugs to the appropriate engineers. As a result, we always get small changes in WebKit all at once, and with a similar approach, the effect on our source tree is usually minimal. We also added bots to WebKit's buildbot, so when WebKit engineers make a change that Chrome breaks, they will know about it immediately.
The great advantage of the DEPS system is that we can add changes to our web platform very quickly. The feature that appeared in WebKit will be available to Chrome users on the
canary channel in just a few days. This encouraged us to make improvements right away in WebKit upstream, where they will be useful to anyone who uses WebKit in their applications, rather than applying them locally in Chrome. Honestly, our basic rule is that we generally do not make local changes in WebKit (as in other projects that Chrome depends on).
Problems
Thorough testing remains an unresolved issue. In particular, “unstable” integration tests (
flaky integration tests ) have become a constant problem for us. Chrome is large, complex, asynchronous, multiprocess, and multi-threaded. So for integration tests, it’s easy to fail because of subtle synchronization problems, which happens from time to time. On a project of our size, a test that fills 1% of the time, as a result, it will be guaranteed to fayl several times a day.
As soon as the test becomes “unstable”, the team quickly acquires the habit of ignoring it, and as a result, it is easy to miss the file of another, working test for the same part of the code. Therefore, we tend to turn off “unstable” tests and lose as a percentage of coverage, making it easier to issue basic regressions for the user.
Another problem is that at such a speed it becomes hard to “induce beauty”. As for me, the team is easier to achieve the correct implementation of all the details in the rare "loud" releases, rather than trying to keep and constantly keep the focus of attention on every little thing for an indefinite time. Since it is often hard to test small details like the distance between the buttons in the toolbar, errors easily creep in such places.
Finally, it seems to me that stress is quite a real problem. Considering all this code, which is constantly changing, even if a person tries to focus only on his area of ​​responsibility, this does not mean that he is not touched by something happening in another part of the project. If you are constantly trying to keep your part of Chrome in working condition, then sooner or later you will begin to feel that you live like a volcano, and you can not afford a single moment of peace.
We deal with the latter problem by breaking the code base into the main modules. The engineers at Carnitas are trying to establish clearer and more robust interfaces between some of our main components. At the moment, much of the code has become cleaner and clearer, but it is too early to talk about how much this will help reduce stress at the global level.
Finally
So, thanks to which the "wheels" have not yet "fallen off"? In short: thanks to life without branches, switches to the runtime environment, tons of automatic tests, ruthless refactoring, and holding positions as close as possible to the HEAD of our dependencies.
These techniques will be most useful for large projects that have fast-changing upstream dependencies, but it is possible that some of them will be applicable to smaller projects.