Some time ago, at a great
maintainerati conference, I talked with a few maintainers friends about scaling up really large open source projects and how GitHub pushes projects to a certain way of scaling. The Linux kernel has a completely different model that GitHub mainteters do not understand. I think it’s worth explaining why and how it works and how it differs.
Another reason for writing this text was the
discussion on HN about my presentation “
Maintainers do not scale ”, where the most popular comment came down to the question “Why don't these dinosaurs use modern development tools?”. Several well-known kernel developers vigorously defended mailing lists and patch suggestions through a mechanism similar to GitHub pull requests, but at least a few graphics developers would like to use more modern tools that are much easier to automate with scripts. The problem is that GitHub does not support the way in which the Linux kernel scales to a huge number of contributors, and therefore we simply cannot switch to it, even for several subsystems. And it's not about hosting the data on Git, this part is clearly in order, but the matter is how pull requests, discussion of bugs and forks work on GitHub.
GitHub style scaling
Git is cool because everyone can fork very easily, create their own branch and change the code. And if you end up with something worthwhile, you create a pull-request for the main repository and it is examined, tested and merged. And GitHub is cool because it presented a suitable UI so that these complex things would be nice and easy to find and learn, so it’s much easier for newbies to get into the course.
But if the project eventually became extremely successful and no amount of tags, tags, sorting, bots and automation are able to cope with all pull-requests and problems in the repository, then the time comes again to divide the project into more manageable parts. More importantly, with a certain size and age of the project, different parts will need different rules and processes: the sparkling new experimental library has different stability and CI criteria than the main code, or maybe you have a legacy garbage can with a bunch of excluded plug-ins that are already not supported, but cannot be deleted. One way or another, you will have to divide a huge project into subprojects, each with its own process and criteria for merging patches and with its own repository, where they use their pull requests and problem tracking. Usually, it takes from several dozen to several hundred contributors working full time so that the headache of the project grows to such an extent that division into parts becomes necessary.
')
Almost all projects on GitHub solve a problem by splitting a single source tree into many different projects, each with its own separate functionality. Usually this leads to the emergence of a number of projects that are considered the core, plus a bunch of plug-ins, libraries and extensions. Everything is connected by some type of plugin or package manager, which in some cases directly receives the code from the GitHub repositories.
Since almost every large project is designed in this way, I do not think we should delve into the advantages of this approach. But I would like to highlight some of the problems that arise in this situation:
- Your community is too fragmented. Most contributors will simply deal with the code and the corresponding repository where they directly contribute, ignoring everything else. They are cool, but the probability of noticing duplicate efforts and parallel solutions between different plug-ins and libraries is significantly reduced. And people who want to manage the entire community will have to deal with a bunch of repositories that are either managed via a script, or Git submodules, or something worse. In addition, they will drown in pull requests and problems if they sign up for anything. Any topic (maybe you have a common toolkit for building builds or documentation, or anything else) that doesn't fit perfectly into shared repositories becomes a headache for maintainers.
- Even if you notice the need for refactoring and code sharing, there are more bureaucratic obstacles: first you need to release a new version of the key library, then go through all the plugins and update them, and only then, maybe you can delete the old code in the shared library . But since everything is very scattered around, you can forget about the last step.
Of course, all this requires not so much work, and many projects do an excellent job with management. But it still takes more effort than a simple pull request to a single repository. Very simple refactoring operations (as a simple sharing of a single new function) occur less frequently, and over a long time such costs accumulate. With the exception of those cases where you follow the example of node.js, create repositories for each function, and then essentially change Git to npm as the source code management system, and this also seems strange. - Combinatorial explosion of theoretically supported different versions, which become de facto unsupported. Users have to perform integration testing. And in the project it all comes down to approved (“blessed”) version combinations, or at least this is declared de facto, since the developers simply close the bug reports with the words “please update all modules first”. And again, this means that you actually have a mono-repository, except perhaps not on Git. Well, only if you use sub-modules, and I'm not sure that this is considered Git ...
- A painful reorganization when dividing common projects into subprojects, because you need to reorganize Git repositories and how they are divided. In a single repository, changing the maintainer is reduced to a simple update of the OWNER or MAINTAINERS files, and if your bots are in order, the new maintainers will receive tags automatically. But if scaling means for you the separation of repositories into separate sets, then any reorganization will be as painful as the first step from a single repository to a group of shared repositories. This means that your project will get stuck in a bad organizational structure for too long.
Interlude: why are there pull requests
The Linux kernel is one of several projects I know that is not divided in this way. Before we look at how it works (the core is a giant project and it simply cannot work without some subproject structure), it seems to me that it’s interesting to see why Git needs pull requests. On GitHub, this is the only way for developers to add patches to the common code. But changes in the kernel come as patches on the mailing list, even long after the introduction and widespread use of Git.
But already the first version of Git supported pull requests. The audience of these first, fairly raw, releases were kernel maintainers, Git wrote to solve Linus' maintener problems. Obviously, Git was needed and useful, but not for processing changes from individual developers: even today, and even more so, pull requests were used to process changes to the whole subsystem, synchronize the refactored code, or similar end-to-end changes between different subprojects. As an example, the
network pull-request 4.12 from Dave Miller ,
submitted by Linus , contains more than 2000 commits from 600 developers and a bunch of merges for pull-requests from subordinate maintainers. But almost all the patches themselves are represented by maintainers and are selected from the mailing lists, not the authors themselves. This is a feature of kernel development, that the authors generally do not commit patches to common repositories - and this is why Git separately takes into account the author of the patch and the author of the commit.
Innovation and improvement in GitHub was the use of pull-requests for everything in a row, right down to individual patches. But not for this pull-requests were originally created.
Linux kernel scaling
At first glance, the kernel looks like a single repository, where everything is in one place in the repository of Linus. But this is far from the case:
- Almost no one uses the main repository of Linus Torvalds. If something from upstream works for them, then usually it is one of the stable kernels . But it is much more likely that they have a kernel from their distribution, where there are usually additional patches and backports, and it is not even hosted on kernel.org , it is a completely different organization. Or they have a kernel from their hardware vendor (for SoC and almost everything related to Android), which is often significantly different from everything hosted in one of the “main” repositories.
- No one (except for Linus himself) develops anything for the Linus repository. Each subsystem, and often even the major drivers, have their own Git repositories, with their own mailing lists for tracking patches and discussing problems completely isolated from everyone else.
- Work between subsystems is done on top of the linux-next integration tree , which contains several hundred Git branches from about the same number of Git repositories.
- All this madness is controlled through the MAINTAINERS file and the get_maintainers.pl script, which for each given code snippet can tell who the maintainer is, who should check the code, where the correct Git repository is, what mailing lists to use, and how and where to report bugs. Information is based not just on the location of the file. It also analyzes code templates to verify that cross-subsystem topics like device-tree maintenance or the kobject hierarchy hierarchy are handled by the right experts.
At first glance, it just looks like a sophisticated way to fill each disk space with a bunch of nonsense, which is not interesting to him, but there are many accompanying minor advantages that overlap each other:
- It is completely easy to reorganize, highlighting things in a subproject — just update the MAINTAINERS file, and you're done. The rest is a bit more complicated than it should be, because you may need to create a new repository, new mailing lists and a new bugzilla. This is just a UI problem that GitHub elegantly solved with a cool little fork button.
- It is very, very easy to translate the discussion of pull requests and problems between subprojects, you simply change the Cc field: in your answer. It is also much easier to coordinate work between subsystems, since one pull request can be sent to several subprojects, with only one general discussion taking place (since the Msg-Ids: tags in the distribution list threads are the same for everyone), although the letters themselves are archived in a heap of different archives mailing lists, go through different mailing lists and are in thousands of different mailboxes. A simple discussion of topics and code between subprojects makes it possible to avoid fragmentation, and so it is easier to notice where the use of common code and refactoring will be useful.
- Work between subsystems does not need some kind of dance with releases. You just change the code, which is all you have in a single repository. Note that this is much more efficient than possible in separate repositories. In the case of really aggressive refactoring, you can simply divide the work between several releases, for example, when there are so many users that you can just change them all at once, without causing too much coordination problems.
The huge advantage of refactoring and code sharing has become easier - no need to drag a bunch of legacy garbage with you. This is explained in detail in the document on the absence of delirium with stable APIs . - Nothing prevents you from creating your own experimental add-ons, which is one of the key advantages of a system with multiple repositories. Added your code to your fork and left it there - no one will ever force you to push the code back or push it into one common repository or even transfer it to the main organization, simply because there are no central repositories. It really works well, maybe too well, as evidenced by the millions of lines of code off-line in various repositories of Android hardware vendors.
In general, I think this is a much more powerful model, because you can always roll back and do everything the same way as with multiple disconnected repositories. Heck, there are even kernel drivers that are in their own repository, separate from the main core, like the proprietary Nvidia driver. Well, it's just like the source code glue around the blob, but since it may contain no kernel parts for legal reasons, this is a great example.
It looks like a horror movie about monorepositions!
Yes and no.
At first glance, the Linux kernel looks like a mono-repository, because it has everything. And many people know from personal experience that mono-repositories cause many problems, because from a certain size they simply cannot scale.
But if you look closer, this model is very, very far from a single Git repository. The upstream subsystem and driver repositories alone give you a few hundred. If you look at the entire ecosystem as a whole, including hardware vendors, distributions, other Linux-based operating systems and individual products, you can easily count several thousand main repositories and many more additional ones. And this is without taking into account the Git repositories purely for personal use by individual developers.
The key difference is that in Linux, a single file hierarchy as a common namespace for everything, but a lot of different repositories for all sorts of needs and projects. This is a mono-tree with numerous repositories, not a mono-repository.
Examples please!
Before I begin to explain why GitHub is not capable of providing such a workflow at the moment, at least if you want to retain the benefits of the GitHub UI and integration, you need to look at some examples of how this works in practice. In short, everything is done through Git pull-requests between the maintainers.
A simple case is the passage of changes through the hierarchy of the maintainers until they eventually settle in the tree where necessary. This is easy, because the pull-request always goes from one repository to another, so it can be done with the current GitHub UI.
Much more fun with cross-subsystem changes, because then pull-requests from an acyclic graph are turned into a grid. At the first stage, it is necessary to consider the changes and test them with all involved subsystems and mainteners. In the GitHub workflow, this means pull-requests simultaneously in multiple repositories, with a single thread of discussion between them. In kernel development, this stage is accomplished by submitting a patch to a bunch of different mailing lists, indicating the maintainers as recipients.
The merger is different from the consideration of the patch. Here one of the subsystems is chosen as the main one, it receives all pull-requests, and all other maintainers agree with this option of merging. Usually they choose the subsystem that is most affected by the changes, but sometimes they choose the one in which some work is already underway that conflicts with the pull request. Sometimes they create a completely new repository and a team of maintainers. This often happens for functionality that extends to the whole tree and is not very neatly contained in several files and directories in one place. A recent example is the
DMA mapping tree , which attempts to combine work that is still distributed among drivers, platform maintainers, and architecture support groups.
But sometimes there are numerous subsystems that conflict with a set of changes and that all need to somehow solve a nontrivial merge conflict. In this case, patches are not applied directly (Rebase pull-request on GitHub), but instead, pull-request is used only with the necessary patches, based on the commit that is common to all subsystems - it is added to all subsystem trees. Such a common base is important to avoid contaminating the tree of subsystems with unnecessary changes. Since the further pullules deal only with specific topics, these branches are usually called
thematic branches .
As an example, I can cite support for audio-over-HDMI, I was directly involved in this process. It concerns both the graphics subsystem and the sound driver subsystem. The same commits from the same pull request are
included in the Intel graphics driver , as well as
in the audio subsystem .
A completely different example is that this is not insane - the
only comparable OS project in the world
also chose a single tree with a stream of commits, just like in Linux.
I'm talking about guys with such a giant tree, that they even had to write a completely new virtual GVFS file system to support it ...Dear github
Unfortunately, GitHub does not provide support for such a workflow, at least not natively with GitHub UI. Of course, this can be done simply with a clean Git toolkit, but then you return to the patches on the mailing list and pull requests by mail that are executed manually. I believe this is the only reason why the kernel developer community will not gain anything from switching to GitHub. There is also a small problem that several leading maintainers are totally opposed to GitHub as a whole, but this is no longer a technical issue. And it's not just the Linux kernel. The fact is that, in principle, all gigantic projects on GitHub have problems with scaling, because GitHub actually does not allow them to scale to multiple repositories linked to a mono-tree.So, I have a request for only one feature on GitHub:Please implement pull-requests and bug tracking covering various repositories of the same mono tree.
Simple idea, huge consequences.Repositories and Organizations
First, you need to make possible numerous forks of the same repository in the same organization. Just look at git.kernel.org, most of their repositories are not personal. And even if you support different organizations, for example, for different subsystems, the requirement of having an organization for each repository is stupid and redundant, it unnecessarily complicates access and user management to the limit. For example, in the graphics subsystem we would have one repository for each userspace test suite, a common userspace library and a common set of tools and scripts that are used by maintainers and developers, this GitHub is supported. But then you will add a common subsystem repository, plus a repository for the core functionality of the subsystem and additional repositories for each major driver. These are all forks that GitHub does not. And each of these repositories will have a bunch of branches: at leastone to work on features, and the other to fix bugs in the current release.The merging of all branches into the repository is not suggested, since the meaning of the section on the repository is to also separate pull requests and bugs.A related question: you need to be able to establish connections between the forks after the fact. For new projects that have always been on GitHub, this is not a problem. But Linux will be able to move at most one subsystem at a time, and on GitHub there are already a lot of Linux repositories that are not forks of each other.Pull requests
Pull-requests must be tied to multiple repositories at the same time, while maintaining a single general thread of discussion. You can already reassign a pull request to another branch of the repository, but not to several repositories at the same time. Reassigning pull requests is really important, as new project participants will simply create pull requests to what they consider to be the main repository. Bots can then shuffle them with all the repositories listed in the MAINTAINERS file for the set of files and changes that the repository contains. When I spoke with the GitHub staff, I first suggested that they implement it directly. But I think everything here can be automated with scripts, so it would be better to leave it only for individual projects, since there is no single standard here.There is still a rather ugly UI problem, because the list of patches may differ depending on the branch where the pull request goes. But this is not always the user's mistake, because some of the repositories can already apply some patches.In addition, the pull request must be different for each repository. One maintainer can close it without accepting, because it is decided that another subsystem will accept it, while another maintainer may merge and close the question. In another tree, they can even close the pull request as invalid since it is not applicable to the old version or the fork from the vendor. Even more fun, a pull request can go through the merge several times, with different commits in each subsystem.Bugs
Like pull requests, bugs can apply to many repositories, and you need to be able to move them. As an example, here’s the bug that was first reported to the kernel repository. After sorting, it became clear that this is a driver bug, which is still present in the last branch of development and, therefore, refers to this repository, plus the main upstream-branch and maybe a couple of others.Again, statuses should be separate, because after the appearance of a bugfix in one repository, it does not immediately become available for all the others. It may even be necessary to port it to previous versions of the kernels and distributions, and someone may decide that the bug is not worth it and close it as WONTFIX, even if it is marked as successfully resolved in the corresponding subsystem repository.Conclusion: mono-tree, not mono-repository
The Linux kernel is not going to go to github. But switching to the Linux scaling model as a mono-tree with multiple repositories will be a good concept for GitHub and will help all very large projects that are already hosted there. I think this will give them a new and more effective way to solve their unique problems.