📜 ⬆️ ⬇️

Why GitHub can't host the Linux kernel

Some time ago, at a great maintainerati conference, I talked with a few maintainers friends about scaling up really large open source projects and how GitHub pushes projects to a certain way of scaling. The Linux kernel has a completely different model that GitHub mainteters do not understand. I think it’s worth explaining why and how it works and how it differs.

Another reason for writing this text was the discussion on HN about my presentation “ Maintainers do not scale ”, where the most popular comment came down to the question “Why don't these dinosaurs use modern development tools?”. Several well-known kernel developers vigorously defended mailing lists and patch suggestions through a mechanism similar to GitHub pull requests, but at least a few graphics developers would like to use more modern tools that are much easier to automate with scripts. The problem is that GitHub does not support the way in which the Linux kernel scales to a huge number of contributors, and therefore we simply cannot switch to it, even for several subsystems. And it's not about hosting the data on Git, this part is clearly in order, but the matter is how pull requests, discussion of bugs and forks work on GitHub.

GitHub style scaling


Git is cool because everyone can fork very easily, create their own branch and change the code. And if you end up with something worthwhile, you create a pull-request for the main repository and it is examined, tested and merged. And GitHub is cool because it presented a suitable UI so that these complex things would be nice and easy to find and learn, so it’s much easier for newbies to get into the course.

But if the project eventually became extremely successful and no amount of tags, tags, sorting, bots and automation are able to cope with all pull-requests and problems in the repository, then the time comes again to divide the project into more manageable parts. More importantly, with a certain size and age of the project, different parts will need different rules and processes: the sparkling new experimental library has different stability and CI criteria than the main code, or maybe you have a legacy garbage can with a bunch of excluded plug-ins that are already not supported, but cannot be deleted. One way or another, you will have to divide a huge project into subprojects, each with its own process and criteria for merging patches and with its own repository, where they use their pull requests and problem tracking. Usually, it takes from several dozen to several hundred contributors working full time so that the headache of the project grows to such an extent that division into parts becomes necessary.
')
Almost all projects on GitHub solve a problem by splitting a single source tree into many different projects, each with its own separate functionality. Usually this leads to the emergence of a number of projects that are considered the core, plus a bunch of plug-ins, libraries and extensions. Everything is connected by some type of plugin or package manager, which in some cases directly receives the code from the GitHub repositories.

Since almost every large project is designed in this way, I do not think we should delve into the advantages of this approach. But I would like to highlight some of the problems that arise in this situation:


Interlude: why are there pull requests


The Linux kernel is one of several projects I know that is not divided in this way. Before we look at how it works (the core is a giant project and it simply cannot work without some subproject structure), it seems to me that it’s interesting to see why Git needs pull requests. On GitHub, this is the only way for developers to add patches to the common code. But changes in the kernel come as patches on the mailing list, even long after the introduction and widespread use of Git.

But already the first version of Git supported pull requests. The audience of these first, fairly raw, releases were kernel maintainers, Git wrote to solve Linus' maintener problems. Obviously, Git was needed and useful, but not for processing changes from individual developers: even today, and even more so, pull requests were used to process changes to the whole subsystem, synchronize the refactored code, or similar end-to-end changes between different subprojects. As an example, the network pull-request 4.12 from Dave Miller , submitted by Linus , contains more than 2000 commits from 600 developers and a bunch of merges for pull-requests from subordinate maintainers. But almost all the patches themselves are represented by maintainers and are selected from the mailing lists, not the authors themselves. This is a feature of kernel development, that the authors generally do not commit patches to common repositories - and this is why Git separately takes into account the author of the patch and the author of the commit.

Innovation and improvement in GitHub was the use of pull-requests for everything in a row, right down to individual patches. But not for this pull-requests were originally created.

Linux kernel scaling


At first glance, the kernel looks like a single repository, where everything is in one place in the repository of Linus. But this is far from the case:


At first glance, it just looks like a sophisticated way to fill each disk space with a bunch of nonsense, which is not interesting to him, but there are many accompanying minor advantages that overlap each other:


In general, I think this is a much more powerful model, because you can always roll back and do everything the same way as with multiple disconnected repositories. Heck, there are even kernel drivers that are in their own repository, separate from the main core, like the proprietary Nvidia driver. Well, it's just like the source code glue around the blob, but since it may contain no kernel parts for legal reasons, this is a great example.

It looks like a horror movie about monorepositions!


Yes and no.

At first glance, the Linux kernel looks like a mono-repository, because it has everything. And many people know from personal experience that mono-repositories cause many problems, because from a certain size they simply cannot scale.

But if you look closer, this model is very, very far from a single Git repository. The upstream subsystem and driver repositories alone give you a few hundred. If you look at the entire ecosystem as a whole, including hardware vendors, distributions, other Linux-based operating systems and individual products, you can easily count several thousand main repositories and many more additional ones. And this is without taking into account the Git repositories purely for personal use by individual developers.

The key difference is that in Linux, a single file hierarchy as a common namespace for everything, but a lot of different repositories for all sorts of needs and projects. This is a mono-tree with numerous repositories, not a mono-repository.

Examples please!


Before I begin to explain why GitHub is not capable of providing such a workflow at the moment, at least if you want to retain the benefits of the GitHub UI and integration, you need to look at some examples of how this works in practice. In short, everything is done through Git pull-requests between the maintainers.

A simple case is the passage of changes through the hierarchy of the maintainers until they eventually settle in the tree where necessary. This is easy, because the pull-request always goes from one repository to another, so it can be done with the current GitHub UI.

Much more fun with cross-subsystem changes, because then pull-requests from an acyclic graph are turned into a grid. At the first stage, it is necessary to consider the changes and test them with all involved subsystems and mainteners. In the GitHub workflow, this means pull-requests simultaneously in multiple repositories, with a single thread of discussion between them. In kernel development, this stage is accomplished by submitting a patch to a bunch of different mailing lists, indicating the maintainers as recipients.

The merger is different from the consideration of the patch. Here one of the subsystems is chosen as the main one, it receives all pull-requests, and all other maintainers agree with this option of merging. Usually they choose the subsystem that is most affected by the changes, but sometimes they choose the one in which some work is already underway that conflicts with the pull request. Sometimes they create a completely new repository and a team of maintainers. This often happens for functionality that extends to the whole tree and is not very neatly contained in several files and directories in one place. A recent example is the DMA mapping tree , which attempts to combine work that is still distributed among drivers, platform maintainers, and architecture support groups.

But sometimes there are numerous subsystems that conflict with a set of changes and that all need to somehow solve a nontrivial merge conflict. In this case, patches are not applied directly (Rebase pull-request on GitHub), but instead, pull-request is used only with the necessary patches, based on the commit that is common to all subsystems - it is added to all subsystem trees. Such a common base is important to avoid contaminating the tree of subsystems with unnecessary changes. Since the further pullules deal only with specific topics, these branches are usually called thematic branches .

As an example, I can cite support for audio-over-HDMI, I was directly involved in this process. It concerns both the graphics subsystem and the sound driver subsystem. The same commits from the same pull request are included in the Intel graphics driver , as well as in the audio subsystem .

A completely different example is that this is not insane - the only comparable OS project in the world also chose a single tree with a stream of commits, just like in Linux.I'm talking about guys with such a giant tree, that they even had to write a completely new virtual GVFS file system to support it ...

Dear github


Unfortunately, GitHub does not provide support for such a workflow, at least not natively with GitHub UI. Of course, this can be done simply with a clean Git toolkit, but then you return to the patches on the mailing list and pull requests by mail that are executed manually. I believe this is the only reason why the kernel developer community will not gain anything from switching to GitHub. There is also a small problem that several leading maintainers are totally opposed to GitHub as a whole, but this is no longer a technical issue. And it's not just the Linux kernel. The fact is that, in principle, all gigantic projects on GitHub have problems with scaling, because GitHub actually does not allow them to scale to multiple repositories linked to a mono-tree.

So, I have a request for only one feature on GitHub:

Please implement pull-requests and bug tracking covering various repositories of the same mono tree.

Simple idea, huge consequences.

Repositories and Organizations


First, you need to make possible numerous forks of the same repository in the same organization. Just look at git.kernel.org, most of their repositories are not personal. And even if you support different organizations, for example, for different subsystems, the requirement of having an organization for each repository is stupid and redundant, it unnecessarily complicates access and user management to the limit. For example, in the graphics subsystem we would have one repository for each userspace test suite, a common userspace library and a common set of tools and scripts that are used by maintainers and developers, this GitHub is supported. But then you will add a common subsystem repository, plus a repository for the core functionality of the subsystem and additional repositories for each major driver. These are all forks that GitHub does not. And each of these repositories will have a bunch of branches: at leastone to work on features, and the other to fix bugs in the current release.

The merging of all branches into the repository is not suggested, since the meaning of the section on the repository is to also separate pull requests and bugs.

A related question: you need to be able to establish connections between the forks after the fact. For new projects that have always been on GitHub, this is not a problem. But Linux will be able to move at most one subsystem at a time, and on GitHub there are already a lot of Linux repositories that are not forks of each other.

Pull requests


Pull-requests must be tied to multiple repositories at the same time, while maintaining a single general thread of discussion. You can already reassign a pull request to another branch of the repository, but not to several repositories at the same time. Reassigning pull requests is really important, as new project participants will simply create pull requests to what they consider to be the main repository. Bots can then shuffle them with all the repositories listed in the MAINTAINERS file for the set of files and changes that the repository contains. When I spoke with the GitHub staff, I first suggested that they implement it directly. But I think everything here can be automated with scripts, so it would be better to leave it only for individual projects, since there is no single standard here.

There is still a rather ugly UI problem, because the list of patches may differ depending on the branch where the pull request goes. But this is not always the user's mistake, because some of the repositories can already apply some patches.

In addition, the pull request must be different for each repository. One maintainer can close it without accepting, because it is decided that another subsystem will accept it, while another maintainer may merge and close the question. In another tree, they can even close the pull request as invalid since it is not applicable to the old version or the fork from the vendor. Even more fun, a pull request can go through the merge several times, with different commits in each subsystem.

Bugs


Like pull requests, bugs can apply to many repositories, and you need to be able to move them. As an example, here’s the bug that was first reported to the kernel repository. After sorting, it became clear that this is a driver bug, which is still present in the last branch of development and, therefore, refers to this repository, plus the main upstream-branch and maybe a couple of others.

Again, statuses should be separate, because after the appearance of a bugfix in one repository, it does not immediately become available for all the others. It may even be necessary to port it to previous versions of the kernels and distributions, and someone may decide that the bug is not worth it and close it as WONTFIX, even if it is marked as successfully resolved in the corresponding subsystem repository.

Conclusion: mono-tree, not mono-repository


The Linux kernel is not going to go to github. But switching to the Linux scaling model as a mono-tree with multiple repositories will be a good concept for GitHub and will help all very large projects that are already hosted there. I think this will give them a new and more effective way to solve their unique problems.

Source: https://habr.com/ru/post/336470/


All Articles