Recently Box.net and Zynga gave a presentation on the use of public computing clouds in their infrastructure. The topic interested me, especially in the light of the failure in April 2011 of several availability zones (availability zones) of the Amazon EC2 cloud, which made several large Internet resources and games on Facebook unavailable for several days. The presentations were presented very briefly, the speakers did not disclose specific implementation details. But even superficial data is of interest.
Box.net provides a business level remote storage service. More than 2500 virtual machines are used to service 300 million documents and over 100 TB of disk space, more than 500 of which are occupied by MySQL servers. Box.net uses Scalr software to manage and scale cloud. OpsCode and Puppet are used to manage software and configuration versions.
Scalr deals with monitoring, load balancing and adding new virtual machines. Virtual machines are distributed across three public clouds — Amazon EC2, RackSpace, and OpneStack — which allows Box.net to survive the failure of any two clouds. Copies of virtual machines are added Scalr automatically through each cloud's API. The most difficult task of scaling a site is scaling a database. This task is also solved by Scalr. In case of failure of one MySQL replica in one of the clouds - it is simply copied into the same cloud from another replica. If the MySQL wizard fails, the application is put into read-only mode, after which one of the replicas clones itself and then declares itself as a master. All replicas switch to the new wizard, the application continues to work in the usual mode.
')
Speaker Zynga, CTO of the company Allan Leinwand (Allan Leinwand), began the presentation with a description of the basic infrastructure requirements of the company - lightning-fast scaling after the launch of a new game. The last to rejoice at the success of FarmVille in 2009 was Zynga’s operations department. In the first 26 weeks after the launch of the game, the number of virtual farmers increased by a million, instead of the expected 200 thousand. In the data center Zynga just ended the place - there was nowhere to grow. At that time, the company had some groundwork that allowed it to quickly transfer the application to virtual servers in the Amazon EC2 cloud. This and automatic scaling in Amazon EC2 has increased the number of users to 70 million, making FarmVille one of the most popular online games.
The downside of fame is huge bills from Amazon EC2. It was decided to translate the popular application into their own data centers. But - taking into account the experience gained - in its own cloud, by analogy with Amazon EC2. Requirements for your own cloud - ZCloud - turned out the following:
ZCould should work on x86 architecture.
Support at least 1000 servers.
The use of generally accepted virtualization technologies (Xenserver, KVM).
Use ONLY one virtual machine per physical server.
CentOS support.
Support for accessibility zones, similar to Amazon Availability Zones.
Integration with the RightScale already in use at the time.
Efficiency of the cloud through the network with routing (IP routed network) - that is, eliminate dependence on inter-rack VLANs that are traditional for data centers.
All these requirements were implemented in ZCloud, which operates in two data centers: one - on the east coast of the United States, the second - on the west. Data centers are weakly connected - the unavailability of one should not affect the availability and performance of the application. Allan refused to answer the direct question about the number of servers in ZCloud, making public only the fact that once they had to enter 1000 new servers into the cloud in 24 hours.
As with Box.net, a third-party application is used to control and scale the cloud, in this case RightScale. Zynga implemented balancing and monitoring independently, at least it was not possible to find out any details about this.
Zynga continues to continue to use Amazon EC2, launching new applications first there, studying traffic and popularity. Successful games that achieve a certain traffic are transferred mostly to ZCloud, thus reducing costs and increasing application performance.
Finally, Allan expressed his thoughts on the future of public clouds: they have room to grow and improve. Performance in the public cloud, as a rule, leaves much to be desired. On the other hand, your data center or cloud makes sense only when a certain level of traffic is reached, implying capital investments in hardware and the cost of developing your own cloud.
From myself, I just want to add that the hybrid model with a public / private cloud seemed to me quite interesting. There is also another option in the middle of the cost / scalability / performance - server rental (dedicated server hosting).