Hi Habr! I bring to your attention a translation of the article
Git Virtual File System Design History . To be continued…
The Git Virtual File System (Git Virtual File System, hereinafter GVFS) was created to solve two main tasks:
- Download only files needed by the user.
- Local Git commands should not take into account the entire working directory ( working directory ), but only the files that the user works with.
In our case, the main use case for GVFS is the Windows repository with its 3 million files in the working directory in the amount of 270 GB.
To clone this repository, you will have to download a 100 GB
packfile , which will take several hours. If you still managed to clone it, all local git commands like checkout (3 hours), status (8 minutes) and commit (30 minutes) would take too long due to a linear dependence on the number of files. Despite all these difficulties, we decided to migrate all the Windows code to git. At the same time, we tried to leave git practically untouched, since the popularity of git and the amount of publicly available information about it were among the main reasons for migration.
')
It should be noted that we considered a huge number of alternative solutions before we decided to create GVFS. We will describe in more detail how GVFS works in the following articles, now we will concentrate on the options we have considered and why a virtual file system has been created.
Prehistory
Why monolithic repository?
We will immediately understand the simplest question: why does anyone even need a repository of this size? Just limit the size of your repositories and everything will be fine! Right?
Not so simple. Already
written many articles about the benefits of monolithic repositories. Several large teams at Microsoft have already tried to break their code into many small repositories, and as a result, they were inclined to think that a monolithic repository is better.
Breaking a large amount of code is not easy, moreover, it is not a solution to all problems. This would solve the problem of scaling in each individual repository, but at the same time would complicate changes to several repositories at the same time and as a result, the release of the final product would become more laborious. It turns out that, with the exception of the scaling problem, the development process in a monolithic repository looks much simpler.
VSTS (Visual Studio Team System)
The VSTS toolkit consists of several related services. Therefore, we decided that by placing each of them in a separate git repository, we immediately get rid of the scaling problem, and at the same time create physical boundaries between different parts of the code. In practice, these boundaries did not lead to anything good.
First, we still had to change the code in several repositories at the same time. It took a lot of time to manage dependencies and follow the correct sequence of commit and pull requests, which in turn led to the creation of a huge number of complex and unstable utilities.
Secondly, our release process has become much more complicated. In parallel with the release of a new version of VSTS, every three weeks we release a boxed version of TeamFoundation Server every three months. For TFS to work correctly, it is necessary to install all VSTS services on one computer, that is, all services must understand from which versions of other services they depend. Gathering together services that have been developed completely independently over the past three months has proven to be a daunting task.
In the end, we realized that it would be much easier for us to work with a monolithic repository. As a result, all services depended on the same version of any other service. Making changes to one of the services required updating all services dependent on it. Thus, a little more work at the beginning saved us a lot of time with releases. Of course, this meant that we would have to be more cautious about creating new and managing existing dependencies.
Windows
Approximately for the same reasons, the team working on Windows decided to switch to Git. Windows code consists of several components, which theoretically could be split into several repositories. However, this approach had two problems. First, despite the fact that most of the repositories were small, for one of the repositories (OneCore), which occupied about 100 GB, we would still have to solve the problem of scalability. Secondly, such an approach would in no way facilitate the introduction of changes to several repositories at the same time.
Design philosophy
Our philosophy of choosing development tools is that these tools should contribute to the proper organization of our code. If you think your team will be more efficient working in several small repositories, development tools should help you with this. If it seems to you that the team will be more efficient when working with a monolithic repository, your tools should not prevent you from doing so.
Considered alternatives
So over the past few years, we have spent a lot of time trying to get Git to work with large repositories. We list some of the solutions we have considered to solve this problem.
We first tried using submodules. Git allows you to specify (
reference ) any repository as part of another repository, which allows for each commit in the parent repository to specify commits in the sub-repositories on which this parent commit depends and where exactly those commits should be placed in the working directory of the parent. It looks like the perfect solution for splitting a large repository into several small ones. And we spent several months working on the command line utility to work with submodules.
The main scenario of using submodules is to use the code of one repository in another. In some way, the submodules are the same npm and NuGet packages, i.e. library or component that does not depend on the parent. Any changes are made primarily at the sub-module level (after all, it is an independent library with its independent development, testing and release process), and then the parent repository switches to using this new version.
We decided that we could use this idea by breaking our large repository into several small ones and then gluing them back into one super repository. We even created a utility that would allow to run “git status” and commit the code on top of all repositories at the same time.
In the end, we abandoned this idea. Firstly, it became clear that in this way we only complicate the life of the developer: each commit now became two or more commits at the same time, since it was necessary to update both the parent and each of the affected submodular repositories. Secondly, in Git there is no atomic mechanism for executing commits simultaneously in several repositories. Of course, it would be possible to assign one of the servers to be responsible for the transactional nature of commits, but in the end, everything again comes up against the problem of scalability. And thirdly, most developers do not want to be experts in version control systems; they would prefer that the available tools do it for them. To work with git, a developer needs to learn how to work with a directed acyclic graph (DAG, Directed Acyclic Graph), which is not easy, and here we ask him to work with several loosely coupled directed acyclic graphs simultaneously and also follow the order of execution checkout / commit / push in them. This is too much.
Several repositories compiled together
If it didn't work out with submodules, then maybe it will work out with several repositories glued together? A similar approach was applied by android in repo.py and we also decided to try. But nothing good came of it. Working within one repository has become easier, but the process of making changes to several repositories at the same time has become much more complicated. And since commits in different repositories are now completely unrelated to each other, it is unclear which commits from different repositories should be selected for a specific version of the product. This would require another version control system over Git.
Spare storage ( alternates ) Git
There is a notion of
alternate object store in Git. Every time git searches for a commit, a tree or a blob, it starts searching from the .git \ objects folder, then checks the pack-files in the .git \ objects \ pack folder, and finally, if specified in the git settings, it searches for spare storage .
We decided to try to use network folders as backup storage to avoid copying a huge number of blobs from the server for each clone and fetch. This approach more or less solved the problem of the number of files copied from the repository, but not the problem of the size of the working directory and the index.
Another unsuccessful attempt to misuse Git functionality. Spare repositories were created in Git to avoid re-cloning objects. For secondary cloning of the same repository, you can use the storage of the objects of the first clone. It is assumed that all objects and pack-files are locally, access to them is instant, there is no need for an additional cache. Unfortunately, this does not work if the backup storage is located on another machine on the network.
Surface cloning ( shallow clones )
Git has the ability to limit the number of cloned commits. Unfortunately, this limitation is not enough to work with huge repositories like Windows, since each commit, together with its trees and blobs, takes up to 80 GB. Moreover, in most cases for normal operation we do not need the contents of commits entirely.
In addition, surface cloning does not solve the problem of a large number of files in the working directory.
Partial ( sparse ) checkout
With a checkout, Git by default places all the files from this commit into your working directory. However, in the .git \ info \ sparse-checkout file, you can limit the list of files and folders that can be placed in the working directory. We had high hopes for this approach since most developers work with only a small subset of files. As it turned out, the partial has its drawbacks:
- The action of partial checkouts does not apply to the index, only to the working directory. Even if you limit the size of the working directory to 50 thousand files, the index will still include all 3 million files;
- Partial checkout is static. If you included directory A , and someone later added a dependency to directory B , your build will be broken until you include B in the list of directories for a partial checkout;
- The counting checkout does not apply to files downloaded during clone and fetch operations, so even if your checkouts have nothing to do with 95% of the files, you still have to download them;
- UX ( User experience ) partial checkouts inconvenient to use
Despite all of the above, partial checkouts turned out to be one of the fundamental parts of our approach.
Storage for large files ( LFS, Large File Storage )
Every time we change a large file, a copy is created in the Git change history. To save space, Git-LFS replaces these large blob files with their pointers, and the files themselves are placed in a separate storage. Thus, during cloning, you download only pointers to files, and then LFS downloads only files that you checkout.
It was not easy to get LFS to work with the Windows repository. As a result, we succeeded, which allowed us to significantly reduce the total size of the repository. But we have not solved the problem of a large number of files and the size of the index. I had to abandon this approach.
Virtual file system
Here are the conclusions we came to after all the above experiments:
- Monolithic repository is our only way
- Most developers need only a small subset of files in the repository to work. But it is important for them to be able to make changes in any part of this repository.
- We would like to use the existing git client without making a huge amount of changes to it.
As a result, we decided to focus on the idea of a virtual file system, the main advantages of which include:
- Downloading only the required minimum blob files. For most developers, this means about 50-100 thousand files and their change history. Most of the repository will never be copied.
- With a little tricks, you can force git to take into account only the files that the developer works with. Thus, operations like git status and gitcheckout will be performed much faster than if they were performed on all 3 million files.
- With the help of partial checkouts, we can place only used files in the working directory. And more importantly, each checkout will be limited only by the files the developer needs.
- All the development tools we use will continue to work without changes, the file system will ensure that the requested files are always available.
However, without difficulties, of course, will not do:
- Developing a file system is not easy
- Performance should be on top. We may be forgiven for a short delay in the first access to the file, but repeated access to the same file should be almost instant.
- Delays in accessing files can and should be reduced by preloading. As a rule, a large number of small objects introduces the greatest delays. An example would be a huge number of very small git tree objects.
- We have yet to figure out how to make git work with a virtual file system. How to avoid bypassing git-ohm all 3.5 million files if it looks to git as if all these files are actually on the disk? Do git commands work well considering partial checkout settings? Is it possible that git will bypass too many blob files?
In the next article we will discuss how these problems were solved in GVFS.