Git scaling (and some background history)

A few years ago, Microsoft decided to start a long process of restoring the development system throughout the company. We are a big company, with many teams - each has its own products, priorities, processes and tools. There are some “common” tools, but there are many different ones - and a VERY BIG number of one-time-use tools developed internally (by teams, I mean units - thousands of engineers).

This has negative sides:

A lot of excess investment in teams that develop similar tools.
The inability to finance any tools to the "critical mass".
Difficulties for employees in moving around the company due to different tools and process.
Difficulty in exchanging code between organizations.
Disagreements with newbies at the beginning of work due to the excessive abundance of tools "only for MS".
And so on...

We took the initiative, which we call “One Engineering System” (One Engineering System) or abbreviated 1ES. Just yesterday, we had day 1ES, where thousands of engineers gathered to celebrate progress, learn about the current state of affairs and discuss plans. It was a surprisingly good event.
')
Let's get back to the topic ... You can ask a question - hey, you have been telling us for years that Microsoft has been using Team Foundation Server, have you been lying to us? No, I did not lie. More than 50 thousand people regularly use TFS, but they do not necessarily apply it to all their work. Some apply for everything. Some are only for tracking working topics. Some are only for version control. Some for builds ... We have internal versions (and in many cases more than one) of almost everything that TFS does, and someone somewhere uses them all. There is a bit of chaos here, absolutely honest. But if you combine and weigh, then we can safely say that TFS has more users than any other set of tools.

I also want to note that by saying "engineering system", I use the term VERY widely. It includes, but is not limited to:

Source Code Management
Work management
Assembly
Releases
Testing
Package management
Telemetry
Incident Management
Localization
Security scan
Availability
Compliance Management
Code signature
Static analysis
and many many others

So back to the story. When we embarked on this path, there was some fierce debate about where we were going, what should be the main one, and so on. You know, developers never have an opinion. :) There was no way to solve everything at once, without failing, so we agreed to start with three problems:

Work planning
Source control
Assembly

I don’t want to go into the reasons in detail, except that these are fundamental fundamentals, and much more is integrated with them, built on them, so it makes sense to choose these three problems for the beginning. I also note that we had HUGE troubles with build time and reliability due to the size of our products - some programs consist of hundreds of millions of lines of code.

Over time, these three main themes have grown, so the 1ES initiative has affected almost all aspects of our development process to varying degrees.

We made some interesting bets. Among them:

Behind the cloud is the future - Most of our infrastructure and tools are located locally (including TFS). We agreed that the future behind the cloud is mobility, control, evolution, elasticity, all the reasons that may come to mind. A few years ago it was very controversial. How can Microsoft transfer all of its intellectual property to the cloud? What about performance? What about security? Reliability? Legal compliance and management? What about ... It took time, but in the end a critical mass of people agreed with the idea. Over the years, this decision has become more and more clear, and now everyone is delighted with the move to the cloud.

First person == third party - This expression (1st party == 3rd party) we use within the company. It means that we strive to maximize our commercial products - and vice versa, to sell products that we use ourselves. It does not always work out at 100% and this is not always a parallel process, but this is the direction of motion — a default assumption until a good reason for acting differently appears.

Visual Studio Team Services are the basis - We have made a bet on Team Services as the basis. We need a fabric that brings together all our development system - the central hub, where you learn everything and achieve everything. The hub should be modern, abundant, expandable, etc. Each group should be able to contribute and share their specific contributions to the development system. Team Services are great for this role. Over the past year, the audience of these services at Microsoft has grown from a couple thousand people to more than 50,000 loyal users. As with TFS, not every group uses them for everything possible, but the momentum in this direction is strong.

Team Services Scheduling - Choosing Team Services made it pretty natural to choose the appropriate scheduling options. We loaded groups like a Windows group, with many thousands of users and many millions of work items, into a single Team Services account. To make it work, we had to do a lot of work on performance and scaling along the way. At the moment, almost every group at Microsoft has made this transition and all our development is managed through Team Services.

Orchestration Team Services Build and CloudBuild - I won't dig this topic too deeply because it is gigantic in itself. I will only say about the result that we chose Team Services Build as our system for orchestrating assembly operations, and managing Team Services Build as our user interface. We have also developed a new “make-engine” (which has not yet been released) for some of the largest code bases, it supports fine-tuned caching on a large scale, parallel execution and incrementality, that is, step-by-step execution. We have seen how many hours of assembly sometimes reduced to minutes. We will talk more about this in a future article.

After a great history - the most important thing.

Git to manage source code

Probably the most controversial decision we made about the source code management system. We had an internal system called Source Depot, which absolutely everyone used in the early 2000s. Over time, TFS and its Team Foundation Version Control solution became popular in the company, but they could not penetrate the largest development teams, such as Windows and Office. I think there are many reasons. One of them is that for such large groups the price of the transition was extremely high, and the two systems (Source Depot and TFS) were not so different to justify it.

But version control systems generate intense loyalty — more than any other developer tool. So the fight between TFVC supporters, Source Depot, Git, Mercurial, and others was fierce, and to be honest, we made a choice, without coming to a consensus - it just had to happen. We decided to make Git a standard for many reasons. Over time, this decision got more and more supporters.

Against the choice of Git, too, there were many arguments, but the most reinforced concrete was scaling. There are not many companies with a code base of our size. In particular, Windows and Office (there are others) have a massive size. Thousands of developers, millions of files, thousands of build machines that are constantly running. Honestly, it is amazing. To clarify, when I mention Windows here, I mean all versions - these are Windows for PC, Mobile, Server, HoloLens, Xbox, IOT, and so on. And Git is a distributed version control system (DVCS). It copies the entire repository and its entire history to your local machine. It would be funny to do this with the Windows project (and we laughed a lot at the beginning). Both TFVC and Source Depot were carefully tuned and optimized for large code bases and specific development teams. Git was never used for such a task (or even within the same order of magnitude), and many argued that the system would never work.

The first big controversy was how many repositories were created, one for the whole company or one for each small component? Great range. Git proved its exceptionally good work for a very large number of modest repositories, so we spent a lot of time thinking about breaking our bulk code bases into a large number of repositories of moderate size. Hmmmm Ever worked with a huge code base for 20 years? Ever tried to go back and break it into small repositories? You can guess what answer we came to. This code is very difficult to break apart. The price will be too high. Risks from this level of confusion will become monstrous. And we really have scenarios when a single engineer needs to make radical changes in a very large amount of code. Coordinating this among hundreds of repositories will be very problematic.

After a long arm-twisting, we decided that our strategy should be "The correct number of repositories, depending on the nature of the code." Some code can be distinguished (as microservices) and it is ideal for isolated repositories. Some code can not be divided into parts (as the core of Windows) and it should be perceived as a single repository. And I want to emphasize that it is not only the complexity of breaking the code into parts. Sometimes, in large interconnected code bases, it is indeed better to take this code base as a whole. Maybe someday I will tell the story of the attempts of the Bing group to separate the components of the Bing key platform into separate packages - and about the problems of versioning that they encountered. Now they are moving away from this strategy.

Thus, we had to start scaling Git to work on code bases with millions of files of hundreds of gigabytes and used by thousands of developers. By the way, even Source Depot never scaled the entire Windows code base. It was broken into more than 40 repositories to be able to scale. But a layer was built on top, so that in most cases the code base could be perceived as a whole. Such an abstraction was not perfect and definitely caused some controversy.

We unsuccessfully began to scale Git in at least two ways. Probably the most significant was the attempt to use Git submodules to stitch together many repositories into a single “super repository”. I will not go into details, but after 6 months of working on the project, we realized that it would not function - too many boundary situations, complexity is high, the project is too fragile. We needed a proven, reliable solution that would be well supported by almost all of the Git tools.

Almost a year ago, we returned to the beginning and concentrated on the question of how to actually scale Git to a single repository containing the entire Windows code base (including estimates for growth and history), and how to support all developers and machines for building.

We tried the “virtualization” of Git. Git usually downloads everything when it is cloned. But what if not? What if we make storage virtualization so that it downloads only the parts it needs. Thus, the cloning of the 300 GB volume repository becomes very fast. As I enter the commands for reading / writing, the system quietly loads content from the cloud (and then stores it locally, so that in the future data access is done locally). The only downside here is the loss of support for offline work. To do this, you need to “touch” everything in order to leave the manifest for local work, otherwise nothing changes - you still get 100% sure experience with Git. And for our huge code bases this option with virtualization was acceptable.

It was a promising approach, and we began to develop a prototype. We called the project Git Virtual File System or GVFS. We set a goal to make a minimum of changes in git.exe. Of course, we didn't want to fork Git - that would be a disaster. And they did not want to change it so that the community would never accept these changes. So we chose an intermediate path in which the maximum number of changes is made “under” Git - in the virtual file system driver.

The virtual file system driver basically virtualizes two things:

The .git folder where all batch files, history, etc. are stored. This is the default folder for everything. We virtualized it in order to extract only the necessary files and only when necessary.
“Working directory” - the place where you go to actually edit the source code, compile it, etc. GVFS keeps track of the working directory and automatically “checks” every file you touch, creating the impression that all files are actually there , but not requiring resources, unless you actually request access to a specific file.

As our work progressed, as you can imagine, we learned a lot. Among other things, we learned that the Git server must be smart. He should pack the Git files in the most optimal way so as not to send the client more than he really needs - imagine it as an optimization of the locality of the links. So we made a lot of improvements in the Team Services / TFS Git server. We also found that Git has many scripts when it touches files that shouldn't be touched. Previously, it never mattered, because everything was stored locally, and Git was used in medium-sized repositories, so it worked very quickly - but if you touch everything, you have to download 6,000,000 files from the server or scan, this is not a joke. So we spent a lot of time optimizing Git performance. Many of the optimizations we have made will benefit the “normal” repositories to some extent, but these optimizations are critical for mega-repositories. We sent many of these improvements to the Git OSS project and enjoyed the good cooperation with them.

So, fast forward to our days. Everything is working! We have packed all the code from over 40 Windows Source Depot servers within a single Git repository hosted by VS Team Services - and it performs well. You can go (enlist) in a couple of minutes and do all your usual Git operations in seconds. And in all senses this is a transparent service. Just git. Your developers will continue to work as they did, using the tools they used. Your builds just work, and so on. It's just amazing. Magic!

As a side effect, this approach is well reflected in large binary files. It does not extend Git with the help of a new mechanism, as LFS does, no “selections” and the like. You can work with large binary files as with any other files, but only those blobs that you’ve touched are downloaded.

Git merge

At the Git Merge conference in Brussels, Saeed Noursalehi shared with the world what we are doing - including the painful details of the work done and what we understood. At the same time, we laid out all our work in open source. We also included several additional server protocols that needed to be presented. You can find the GVFS project and all changes made in Git.exe in the Microsoft GitHub repositories. GVFS relies on the new Windows filter driver (the moral equivalent of the FUSE driver for Linux), and we worked with the Windows team to release this driver early so that you can try GVFS. For more information and links to additional resources, see the post Said . You can study them. You can even install GVFS and try it out.

While I note the performance of GVFS, I want to emphasize that there is still much to be done. We have not finished everything. We think that we have proven the concept, but a lot of work still needs to be done to put it into practice. We make an official announcement and publish the source code to involve the community in working together. Together, we will be able to scale Git for the largest code bases.

Sorry for the long post, I hope it was interesting. I am delighted with the work done - both in the framework of the 1ES initiative at Microsoft, and over the scaling of Git.

Source: https://habr.com/ru/post/325116/

All Articles

Git scaling (and some background history)

Git to manage source code

Git merge

More articles: