Traditionally, the Windows development team signs a poster (in this case, a DVD image) with the release of a new version of Windows. By the end of the release party, there will be hundreds or thousands of signatures on it.“Experience is what you get only after you need it.” - Stephen Wright
I liked Terry Crowley's content blog (
“What really happened to Vista” ). Terry worked in the Office group and did a fantastic job describing the intricate intrigues around Windows Vista and the related but abandoned Longhorn project from the perspective of an outside observer.
He correctly noted many of the problems that had plagued the project, and I do not want to repeat them again. I just thought that it would be honest to present an insider view of the same events. I do not expect the same eloquent or exhaustive statement as Terry’s, but I hope to shed some light on what went wrong. Ten years have passed since the release of the first version of Windows Vista, but these lessons now seem more relevant than ever.
Windows is a monster. Thousands of developers, testers, program managers, security specialists, UI designers, architects, etc. And that's not counting the personnel from the human resources department, recruiters, marketing guys, salespeople, lawyers, and, of course, many managers, directors, and vice presidents in each of these areas. This entire group is surrounded by many thousands of employees from our partners (both inside and outside Microsoft), who supplied everything: from hardware and device drivers to applications running on the platform.
')
Aerial photography of the Windows development team on the football field at MicrosoftAt the time, organizationally, Windows was actually divided into three groups: Core, Server, and Client. The Core group delivered a “framework”: all key components of the operating system that are common to all versions of Windows (kernel itself, storage system, security, network subsystem, device drivers, installation model and updates, Win32, etc.). In turn, the server group focused on technologies for the server market (terminal services, clustering and uninterrupted work, corporate management tools, etc.), and the client group was responsible for technologies related to the desktop and user versions (web browser, media player, graphics , shell, etc.).
Of course, many reorganizations took place, but the basic structure was always maintained, even when Windows became more popular, and the groups themselves increased in size. It will also be fair to say that from a cultural and organizational point of view, the Core group was closer to the server than to the client group - at least that was before the release of Vista.
By the time I joined Microsoft in early 1998, Windows meant Windows NT - architecturally, organizationally, and with respect to the product itself. The Windows 95 code base was largely abandoned, and Windows NT was implemented for every type of Windows, from laptops to servers in a cluster. Two years later, the code base of Windows 95/98 should have been resurrected for one last release - the very Windows ME about which so many slanderous words - but this project was led by a small group, while an absolute majority worked on the NT code base. I was lucky to spend more than ten years in the womb of a monster. I started in the midst of developing Windows 2000 and remained until the completion of Windows 7.
The first seven years I spent in groups responsible for storage, file systems, uninterrupted work / clustering, file-level networking protocols, distributed file systems, and related technologies. Later, I spent a year or two in the Microsoft Security Management Team. It included everything from security technologies in Windows to anti-virus products, security marketing and emergency response, such as security updates. This was nearing the end of the Vista life cycle, when viruses and worms put Windows on their knees and when Microsoft’s reputation as a developer of secure and secure software was massively beaten up in public.
The last three or four years, during the preparation of the release of Windows 7, I managed the entire development of the Core group in Windows. This means ownership of virtually all technologies that work under the hood and are used by both the server and the client group. After the release of Vista, the entire Windows team was organized in directions and in triads (Dev, Test, PM) at all levels of the organization, so I had two accomplices in crime. I led the development teams, while they led the testing and management teams respectively.
The Windows team in the past often tried to master the ambitious and massive projects that were abandoned or redeveloped a few years later. The previous example is the ambitious Cairo project, which was eventually gutted: only some of its parts included in Windows 2000 were saved.
In my humble opinion, the biggest problem so far with the release of Windows was the duration of each release. On average, each release took three years from the start of development to completion, but only 6–9 months of this time was occupied by the development of a “new” code. The rest of the time was spent on integration, testing, alpha and beta stages - each for several months.
Some projects took more than six months to complete key development, so they were created in parallel and merged with the main code base upon completion. This means that the main branch has always been in limbo as large pieces of functionality have been added or changed to it. When developing, Windows 7 installed much tighter controls to ensure a continuously healthy and functioning code base, but previous versions were in a constantly unhealthy state with instability for several months in a row.
The chaotic nature of the development often led development teams to play dangerous schedule games. They convinced themselves and others that their code was in better condition than other projects, that they could “grind” the remaining fragments exactly on time, so that they were allowed to put their component in a semi-finished form.
The three-year release cycle meant that we rarely imagined what the competitive landscape and the external ecosystem would look like at the time of release. If the development of the function did not have time to release, then it was completely abandoned (since it hardly made sense 6 years after the start of development) or, worse, it was “sent to Siberia”, that is, continued development of the component, which for the most part ignored the rest of the organization and which was doomed to failure or uselessness - but the group or leadership simply could not decide to abandon the development. I personally was responsible for several such projects. Vision when looking into the past becomes one hundred percent.
Given that each team was busy promoting its own plan and feature set to release, it often skimped on integrating with other components, the user interface, end-to-end testing, as well as such unpleasant and tedious things as updating, leaving these difficult things for later . In turn, this meant that some groups quickly became the bottleneck of the entire development, and at the last minute they all ran to help in completing the UI or testing the update mechanism.
At each point in time there were several major releases in development, as well as numerous side projects. Different groups were responsible for code bases in a different state of readiness, which eventually led to a result where "the rich become richer and the poor become poorer." Groups that started to fall behind, for one reason or another, usually stayed behind.
When the project was nearing completion, program managers began composing requirements for the next release, and groups in “healthy” state (rich) started to introduce new code, while most of the organization (poor) were still digging with the current release. In particular, testing groups were rarely released before a release was released, so at the beginning of projects a new code was not thoroughly tested. “Unhealthy” groups have always lagged behind, making the final touches for the current release and lagging further and further behind. It was in these groups that developers with the lowest morale and most exhausted ones often worked. This means that new employees of the groups inherited fragile code that they did not write, and therefore did not understand.
For almost the entire development period of Vista / Longhorn, I was responsible for the storage system and file systems. This means that I was involved in the WinFS project, although it was mainly led by the SQL team of the DBMS, a related structure for the Windows team.
Bill Gates personally participated in the project at a very detailed level, he was even jokingly called the “WinFS PM project manager”. Hundreds, if not thousands of person-years, spent developing the idea, whose time simply went away: what if we combine the capabilities of database queries and file system functionality for streaming and storing unstructured data - and open it as a programming paradigm to create unique new rich applications.
In hindsight, it is now obvious that Google has cleverly solved this problem by providing transparent and fast indexing of unstructured data. And they did this for the entire Internet, not just for your local disk. And you do not even need to rewrite your applications to take advantage of this system. Even if the WinFS project were a success, it would take years to rewrite applications so that they can take advantage of it.
When Longhorn was canceled and Vista was hastily assembled from its glowing embers, WinFS was already kicked out of the OS release. The SQL group continued to work on it as a separate project for several years. By this time, the built-in indexing engine and integrated search appeared in Windows - implemented purely on the side without the need to change applications. So the need for WinFS has become even more obscure, but the project has continued.
The massive architectural changes related to security in Longhorn, continued as part of the Windows Vista project after the “reboot” of Longhorn. We learned a lot about security in the rapidly expanding universe of the Internet and wanted to apply this knowledge at the architectural level of the OS to improve the overall security of all users.
We had no choice. Windows XP showed that we were the victims of our own success. Designed for convenience, the system clearly did not meet the security requirements, faced with the reality of the Internet era. To solve these security problems, a parallel project was required. Windows XP Service Pack 2 (despite its name) was a huge undertaking that sucked thousands of resources from Longhorn.
In our next major release of the OS, we definitely could not take a step back in the security requirements. So Vista has become much more secure than any other OS Microsoft has ever released, but this process has managed to break the compatibility of applications and device drivers at an unprecedented level for the ecosystem. Users hated it because their applications did not work, and our partners hated it, because they thought they didn’t have enough time to update and certify their drivers and applications, because Vista was in a hurry to release to compete with the resurgent Apple.
In many ways, these security changes required third-party applications to make profound architectural changes. And most ecosystem vendors were not willing to invest so much in changing their legacy programs. Some of them used an unconventional approach to changing data structures and even instructions in the kernel to implement their functionality, bypassing regular APIs and for multiprocessor locking, which often caused chaos in the system. At some stage, about 70% of all “blue screens” of Windows were caused by these third-party drivers and their unwillingness to use regular APIs to implement their functionality. Especially often this approach was used by antivirus developers.
As a security manager at Microsoft, I personally spent several years explaining to antivirus vendors why we no longer allow them to patch kernel instructions and data structures in memory, why this poses a security risk, and why they should use standard APIs in the future, we will no longer support their legacy programs with deep hooks into the Windows kernel - the very methods that hackers use to attack user systems. Our "friends", manufacturers of anti-viruses, turned around and sued us, accusing us of depriving them of their livelihood and abusing their monopoly position! With such friends who need enemies? They just wanted their old solutions to continue to work further, even if it means reducing the security of our common users - and that’s what they needed to improve.
Over the years, there have been so many cardinal changes in the computer industry — the arrival of the Internet, the proliferation of mobile phones, the emergence of cloud computing, the creation of new business models based on advertising, the viral growth of social media, the inexorable procession of Moore's law and the popularity of free software. These are just some of the factors that attacked Windows from all sides.
The answer was quite logical for a wildly successful platform: stubbornly continuing the course and gradually improving the existing system is the innovator's dilemma in a nutshell. The more code we added, the more complex the system became, the more the staff grew, the more the ecosystem grew and the more difficult it was to catch up with the growing backlog of competitors.
As if pressure was not enough, at one time whole armies of our engineers and program managers spent countless hours, days, weeks and months communicating with representatives of the Ministry of Justice and corporate lawyers to document existing APIs from previous releases to implement government antitrust regulations.
The harsh reality is that at that time, the life cycle took about three years to release a major release of Windows and that was too long for a rapidly changing market. WinFS, security, and managed code are just some of the massive projects that were on the Longhorn agenda. And there were hundreds of smaller rates.
When you have an organization of many thousands of employees and literally billions of users, you need to foresee everything. The same OS release, which was supposed to run on tablets and smartphones, also had to work on your laptop, on servers in the data center and on embedded devices such as network storage, “Powered by Windows” boxes - not to mention working over the hypervisor ( HyperV) in the cloud. These requirements pulled the team in opposite directions, as we tried to make progress in all market segments at the same time.
It is not possible to treat Longhorn and Vista separately. They make sense only in combination with the versions immediately before and after them - Windows 2000 and XP on the one hand, Windows Server 2008 and Windows 7 on the other - and a full understanding of the broader context of the industry in retrospect.
Windows has become a victim of its own success. It has successfully conquered too many markets, and the business in each of these segments has now exerted some influence on the design of the operating system and pulled it in different directions, often incompatible with each other.
Exceptionally successful in the 1990s, the architecture simply collapsed after a decade, because the world around was changing too quickly, while the organization tried to keep up with it. For clarity, we saw all these trends and tried hard to match them, but if you let the metaphors mix, it is difficult to deploy the air liner in the opposite direction, if you are in the second year of pregnancy of your three-year release.
In short, what we knew three or four years ago when planning this OS release became ridiculously outdated, and sometimes obviously wrong, when the product finally came out. The best we could do was switch to a gradual and painless delivery of new cloud services to an ever-simplifying device. Instead, we continued to add features to the client monolithic system, which required many months of testing before each release, slowing us down when we had to speed up. And of course, we did not care about removing the old functionality that is needed for compatibility with applications from older versions of Windows.
Now imagine supporting the same OS for ten years or more for billions of users, millions of companies, thousands of partners, hundreds of use cases and dozens of form factors - and you will begin to enter the support and update nightmare.
Looking into the past, Linux has proven more successful in this regard. The undoubted part of the solution was the open source community and this approach to development. The Unix / Linux modular and pluggable architecture is also a significant architectural improvement in this regard.
Sooner or later, any organization begins to issue its own organizational chart as a product, and Windows is no exception. Open source has no such problem.
"War Room" Windows, later renamed the "bridge" (ship room)If you want, add to this internal organizational dynamics and personality. Each of us had our favorite features, partners from our own ecosystem were pushing us to support new standards to help them get certified on the platform, to add an API for their specific scenarios. Everyone had ambitions and a desire to prove that our technology, our idea would win ... if only we included it in the next release of Windows and instantly delivered to millions of users. We believed in it strongly enough to conduct battles at daily meetings in our military rooms. Each also had a manager who was eager to increase and expand his sphere of influence - or the number of his employees as an intermediate step along the way.
Development teams and testers often come into conflict. The former insisted on ending the verification of the code, while the latter were rewarded for finding more and more complex and esoteric test cases that did not have any real resemblance to the client environments. The internal dynamics were complex, to say the least.
As if that were not enough, at least once a year the company underwent a massive reorganization - and dealt with the new organizational dynamics.By the way, none of this should be accepted as an apology or excuse. It's not about that.
Have we made mistakes? Yes, in abundance.Have we specifically made the wrong decisions? No, I can't remember a single one.Was this an incredibly complex product with an incredibly huge ecosystem (the largest in the world at the time)? Yes, it was.Could we do better? Yes, how else.Would we make other decisions today? Yes.
Vision when looking into the past becomes one hundred percent. Then we did not know what we know now.Should we look into the past with disappointment or regret? No, I prefer to learn the lessons learned. I am sure that none of us made the same mistakes in subsequent projects. We have learned from that experience, which means we will make completely different mistakes next time. Humans tend to make mistakes.