📜 ⬆️ ⬇️

The history of the creation of the Virtual File System Git (GVFS, Git Virtual File System)

Hi Habr! I bring to your attention a translation of the article Git Virtual File System Design History . To be continued…

The Git Virtual File System (Git Virtual File System, hereinafter GVFS) was created to solve two main tasks:


In our case, the main use case for GVFS is the Windows repository with its 3 million files in the working directory in the amount of 270 GB. To clone this repository, you will have to download a 100 GB packfile , which will take several hours. If you still managed to clone it, all local git commands like checkout (3 hours), status (8 minutes) and commit (30 minutes) would take too long due to a linear dependence on the number of files. Despite all these difficulties, we decided to migrate all the Windows code to git. At the same time, we tried to leave git practically untouched, since the popularity of git and the amount of publicly available information about it were among the main reasons for migration.
')
It should be noted that we considered a huge number of alternative solutions before we decided to create GVFS. We will describe in more detail how GVFS works in the following articles, now we will concentrate on the options we have considered and why a virtual file system has been created.

Prehistory


Why monolithic repository?


We will immediately understand the simplest question: why does anyone even need a repository of this size? Just limit the size of your repositories and everything will be fine! Right?
Not so simple. Already written many articles about the benefits of monolithic repositories. Several large teams at Microsoft have already tried to break their code into many small repositories, and as a result, they were inclined to think that a monolithic repository is better.

Breaking a large amount of code is not easy, moreover, it is not a solution to all problems. This would solve the problem of scaling in each individual repository, but at the same time would complicate changes to several repositories at the same time and as a result, the release of the final product would become more laborious. It turns out that, with the exception of the scaling problem, the development process in a monolithic repository looks much simpler.

VSTS (Visual Studio Team System)


The VSTS toolkit consists of several related services. Therefore, we decided that by placing each of them in a separate git repository, we immediately get rid of the scaling problem, and at the same time create physical boundaries between different parts of the code. In practice, these boundaries did not lead to anything good.

First, we still had to change the code in several repositories at the same time. It took a lot of time to manage dependencies and follow the correct sequence of commit and pull requests, which in turn led to the creation of a huge number of complex and unstable utilities.

Secondly, our release process has become much more complicated. In parallel with the release of a new version of VSTS, every three weeks we release a boxed version of TeamFoundation Server every three months. For TFS to work correctly, it is necessary to install all VSTS services on one computer, that is, all services must understand from which versions of other services they depend. Gathering together services that have been developed completely independently over the past three months has proven to be a daunting task.

In the end, we realized that it would be much easier for us to work with a monolithic repository. As a result, all services depended on the same version of any other service. Making changes to one of the services required updating all services dependent on it. Thus, a little more work at the beginning saved us a lot of time with releases. Of course, this meant that we would have to be more cautious about creating new and managing existing dependencies.

Windows


Approximately for the same reasons, the team working on Windows decided to switch to Git. Windows code consists of several components, which theoretically could be split into several repositories. However, this approach had two problems. First, despite the fact that most of the repositories were small, for one of the repositories (OneCore), which occupied about 100 GB, we would still have to solve the problem of scalability. Secondly, such an approach would in no way facilitate the introduction of changes to several repositories at the same time.

Design philosophy


Our philosophy of choosing development tools is that these tools should contribute to the proper organization of our code. If you think your team will be more efficient working in several small repositories, development tools should help you with this. If it seems to you that the team will be more efficient when working with a monolithic repository, your tools should not prevent you from doing so.

Considered alternatives


So over the past few years, we have spent a lot of time trying to get Git to work with large repositories. We list some of the solutions we have considered to solve this problem.

Git submodules


We first tried using submodules. Git allows you to specify ( reference ) any repository as part of another repository, which allows for each commit in the parent repository to specify commits in the sub-repositories on which this parent commit depends and where exactly those commits should be placed in the working directory of the parent. It looks like the perfect solution for splitting a large repository into several small ones. And we spent several months working on the command line utility to work with submodules.

The main scenario of using submodules is to use the code of one repository in another. In some way, the submodules are the same npm and NuGet packages, i.e. library or component that does not depend on the parent. Any changes are made primarily at the sub-module level (after all, it is an independent library with its independent development, testing and release process), and then the parent repository switches to using this new version.

We decided that we could use this idea by breaking our large repository into several small ones and then gluing them back into one super repository. We even created a utility that would allow to run “git status” and commit the code on top of all repositories at the same time.

In the end, we abandoned this idea. Firstly, it became clear that in this way we only complicate the life of the developer: each commit now became two or more commits at the same time, since it was necessary to update both the parent and each of the affected submodular repositories. Secondly, in Git there is no atomic mechanism for executing commits simultaneously in several repositories. Of course, it would be possible to assign one of the servers to be responsible for the transactional nature of commits, but in the end, everything again comes up against the problem of scalability. And thirdly, most developers do not want to be experts in version control systems; they would prefer that the available tools do it for them. To work with git, a developer needs to learn how to work with a directed acyclic graph (DAG, Directed Acyclic Graph), which is not easy, and here we ask him to work with several loosely coupled directed acyclic graphs simultaneously and also follow the order of execution checkout / commit / push in them. This is too much.

Several repositories compiled together


If it didn't work out with submodules, then maybe it will work out with several repositories glued together? A similar approach was applied by android in repo.py and we also decided to try. But nothing good came of it. Working within one repository has become easier, but the process of making changes to several repositories at the same time has become much more complicated. And since commits in different repositories are now completely unrelated to each other, it is unclear which commits from different repositories should be selected for a specific version of the product. This would require another version control system over Git.

Spare storage ( alternates ) Git


There is a notion of alternate object store in Git. Every time git searches for a commit, a tree or a blob, it starts searching from the .git \ objects folder, then checks the pack-files in the .git \ objects \ pack folder, and finally, if specified in the git settings, it searches for spare storage .

We decided to try to use network folders as backup storage to avoid copying a huge number of blobs from the server for each clone and fetch. This approach more or less solved the problem of the number of files copied from the repository, but not the problem of the size of the working directory and the index.

Another unsuccessful attempt to misuse Git functionality. Spare repositories were created in Git to avoid re-cloning objects. For secondary cloning of the same repository, you can use the storage of the objects of the first clone. It is assumed that all objects and pack-files are locally, access to them is instant, there is no need for an additional cache. Unfortunately, this does not work if the backup storage is located on another machine on the network.

Surface cloning ( shallow clones )


Git has the ability to limit the number of cloned commits. Unfortunately, this limitation is not enough to work with huge repositories like Windows, since each commit, together with its trees and blobs, takes up to 80 GB. Moreover, in most cases for normal operation we do not need the contents of commits entirely.
In addition, surface cloning does not solve the problem of a large number of files in the working directory.

Partial ( sparse ) checkout


With a checkout, Git by default places all the files from this commit into your working directory. However, in the .git \ info \ sparse-checkout file, you can limit the list of files and folders that can be placed in the working directory. We had high hopes for this approach since most developers work with only a small subset of files. As it turned out, the partial has its drawbacks:


Despite all of the above, partial checkouts turned out to be one of the fundamental parts of our approach.

Storage for large files ( LFS, Large File Storage )


Every time we change a large file, a copy is created in the Git change history. To save space, Git-LFS replaces these large blob files with their pointers, and the files themselves are placed in a separate storage. Thus, during cloning, you download only pointers to files, and then LFS downloads only files that you checkout.

It was not easy to get LFS to work with the Windows repository. As a result, we succeeded, which allowed us to significantly reduce the total size of the repository. But we have not solved the problem of a large number of files and the size of the index. I had to abandon this approach.

Virtual file system


Here are the conclusions we came to after all the above experiments:


As a result, we decided to focus on the idea of ​​a virtual file system, the main advantages of which include:


However, without difficulties, of course, will not do:


In the next article we will discuss how these problems were solved in GVFS.

Source: https://habr.com/ru/post/330056/


All Articles