
Hi, my name is Arthur de Haan, and I am responsible for testing and system design in Windows Live. I would like to give you a glimpse behind the scenes of Hotmail, and tell you more about what is needed to create, deploy, and launch
Windows Live Hotmail on such a global scale.
Storing your mail and data (and our own data) on our servers is a big responsibility and we pay a lot of attention to quality, performance and reliability. We make significant investments in engineering and infrastructure so that Hotmail works 24 hours a day, day after day, year after year. You will rarely hear about these efforts, you will hear about them in those rare cases when something goes wrong and our service faces a problem.
Hotmail is a gigantic service in all dimensions. Here are some of the main ones:
- Our service is present all over the world. Hotmail comes in 59 regional markets, in 36 languages.
- We provide over 1.3 billion mailboxes (some users have multiple mailboxes)
- More than 350 million people actively use Hotmail every month (according to comScore, August 2009)
- We process over 3 billion messages per day and filter over 1 billion spam emails
- Data grows by 2 petabytes per month
- We currently have over 155 petabytes of data stored (70% of this amount of attachment, usually photos)
- We have the world's largest SQL Server 2008 database, we monitor and manage many thousands of Sql servers
You can imagine that the user interface of Hotmail is just the tip of the iceberg, most of the innovations occur inside and are not visible to the user. In this post I will give a high-level overview of the architecture of the entire system. We will make a deeper immersion in some of the features in the next posts (from the translator: if this article is like the community, I can translate these subsequent posts)
')
Architecture
Hotmail and other Windows Live services are located in several data centers around the world. Hotmail is organized into logical scalable elements - clusters. In addition, we have an infrastructure that distributes the load between the clusters in each data center:
- Servers for incoming and outgoing mail processing
- Spam filters (from translator: if this article is like the community, I can translate the blog post about spam filters in Hotmail)
- Storage of user data and data obtained from our monitoring systems
- Incident Monitoring and Response Infrastructure
- Infrastructure to manage automated code deployment and update setup
There are several million users on one cluster (how much depends on the age of the hardware) and a stand-alone set of servers, including:
- Frontend servers - servers that check a message for viruses and place code that is responsible for communicating with your browser or email client using protocols such as POP3 and DeltaSync
- Backend servers - SQL servers, file servers, spam filters, monitoring data storage and spam filters, agent and server catalog, inbound and outbound mail processing
- Load Balancer - hardware and software used to evenly distribute the load to improve overall performance.
Preventing malfunctions and data loss is our highest priority, and we take every precaution to prevent this from happening. We designed our service to handle failures efficiently, given our assumption that anything that might fail would do so over time. We are experiencing hardware failures, among the hundreds of thousands of hard drives that we use are those that fail. Fortunately, due to the nature of the architecture and the timely handling of failures, customers rarely notice this kind of failure.
Here are some ways to prevent crashes:
- Redundancy - we use a combination of SQL Server storage arrays to preserve your data. We use active and passive fail-safe technology. This is an unusual way to say that we have a lot of servers and copies of your data that are constantly synchronized. In general, we store 4 copies of your data on different disks and servers to minimize the chance of data loss in case of a hardware error.
- Another advantage of this architecture is that we can perform scheduled maintenance, such as deploying updates or fixing security, without downtime.
- Monitoring - we have an extensive system of monitoring software and hardware. Thousands of servers monitor the health of the service, transactions, and overall system performance. Since our service is so huge, we monitor performance and uptime in aggregate, as well as at cluster level and by geographic location. We want to be sure that your personal experience will come back to us and will not be lost when we look at the general indicators of the system. We care about each of the users. In future posts we will talk more about monitoring and performance.
- Response Team - we have a 24-hour response team that monitors our global monitoring system and takes action whenever there is a problem. We have an expansion process that our engineers can do for a few minutes if necessary.
Technological process
I talked a little about our architecture, and the steps that we are taking to ensure uninterrupted service. However, our service is not static; in addition to growth through use, we regularly update. Thus, our processes are as important as the architecture in order to provide you with an uninterrupted service. We follow certain precautions when deploying new code, from patches and small updates to major releases.
Testing and deployment. For each developer, we have a testing engineer who works hand in hand with the developer to contribute to the development and writing of specifications, creating a testing infrastructure, writing automatic tests to test new features, ensuring quality. When we talk about quality, we are talking not just about stability and reliability, but also about ease of use, performance, security, availability (for users with disabilities), privacy, scalability and functionality in all browsers.
Since we are a free service funded by advertising, we must be highly efficient. Therefore, the deployment, configuration and maintenance of our systems is a highly automated process. Automation also reduces the risk of human error.
Deploy code and change management. We have thousands of servers in the test lab, where we deploy and test the code, long before it hits the client. In data centers, we also have clusters specifically reserved for testing “dogfood” and beta versions at the final stage of development. We check all changes in our laboratories, be it a hardware or software update, or a security fix.
When all engineering teams sign a release (including testers and engineers), we begin the gradual deployment of updates on clusters around the world. We usually do this for several months, not only because it takes a lot of time, but also to make sure that it does not affect the quality and performance of the service.
We can also turn on or off some features separately. Sometimes we deploy updates, but postpone their inclusion. In rare cases, we block some features for security or performance reasons.
Conclusion
This topic should give you an understanding of the scale of development, which implies the development of Hotmail. We are committed to technical excellence and continuous improvements to our services for you. We continue to study how the service is growing, and listen to your feedback, seriously, you can leave me comments
here with your thoughts and questions. I am a passionate fan of our services, like the whole Windows Live team - we can be engineers, but we use the services ourselves, along with millions of our users.