
According to the latest EMC research, the amount of data generated in 2012 is 2.8 zettabytes (10 ^ 21 bytes) and by 2020 this figure will grow to 40 zettabytes, which is 14% more than previous forecasts. We can safely say that we have already faced the “great data flood” and one of the answers to this is the growing share of the largest data centers, which are often called “mega-data center” - their share, according to various estimates, is about 25% of the market for modern servers .
Just like, for example, in Formula 1 racing, innovations are born precisely where extreme performance is needed, but very soon many of these innovations are introduced in conventional production cars. Similarly, mega-data centers are one of the main centers of innovation in information technology. Many companies use examples of super-large data centers in their solutions for processing "big data", private clouds and computing clusters. Mega-TsODy are the landfills on which cutting-edge solutions for scaling, improving efficiency and cost-effectiveness.
Most often, mega-centers of data processing are built by giant companies, such as Apple, Google, Facebook (a more interesting and rare example is Chinese companies such as Tencent and Baidu), so these data centers are not highly specialized. The servers of these data centers are involved in data storage, database maintenance, provision of web servers, and more company-specific tasks: searching, analyzing search queries, analytics, and so on.
')

The scale of such DCs is amazing - they usually contain from 200,000 to 1,000,000 servers, in which up to 10 million drives are installed. Depending on server tasks, they can contain only boot disks, unprotected disks, or protected RAID arrays for critical data. Frequently, flash-based hybrid solutions are used to speed up disk subsystems, for example, such as LSI Nytro, which I already wrote about in the last article
[link] .
Servers are usually grouped into clusters containing approximately 200 to 2000 nodes. Such clusters are designed so that you can quickly turn off the problem node in case of failure and redistribute the load between the remaining ones. This is usually done at the program level.

Since in the mega-data center one application often runs on thousands and hundreds of thousands of nodes, the speed of information transfer between nodes becomes very critical. To overcome these problems, large data centers use 10 GbE and 40 GbE technologies. Since mega-data center networks are usually static (this also helps reduce transaction processing time), a software-configured network (SDN) is often used.
Virtualization is rarely used, mainly to simplify the introduction and duplication of images. Software, most often open source, allows for more subtle refinement and customization (mega-data centers are, in general, things that are created for the specific purposes of specific companies).
In such data centers, the question of reducing operating costs by eliminating everything superfluous (of course, up to a certain limit) is extremely acute. The goal of optimization is to eliminate everything that does not belong to the main tasks, even if it is “got for free”, since even a free solution initially may lead to further operating costs. A simple example: if you add an unnecessary LED to each server, then if there are 200,000 servers, the cost of the LEDs will be about $ 10,000, and even if these LEDs are “free”, the power consumption will increase by about 20 kW.

Mega-data center problems
In general, in terms of problems, such data centers are similar to their “younger brothers”: they also need to ensure that heavy applications can be executed at maximum speed, and scaling and cost optimization is also important. The only exception is, due to the size, any mistake, problem or inefficiency is much more expensive.
One of these problems is disk failures. Despite the low cost of replacement, massive failures are the most serious problem causing serious malfunctions in the operation of individual clusters, and sometimes the entire data center. Archival storage, which is usually used to solve this problem, consume a lot of electricity and other resources, even if the information on them is used infrequently. This is especially noticeable with the growth of data volumes, which are sometimes calculated not by petabytes, but by exabytes.

Lessons from mega-data centers
As I wrote above, many architectural solutions for ultra-large data-cents find their place among smaller data centers, since they make it possible to achieve incredible efficiency in the ratio of resources expended and processing power. What are these principles?
The first, and perhaps the main one, is the need to make the infrastructure as homogeneous as possible. This infrastructure is much easier to maintain and scale. Optimization of costs, where it is not critical, allows you to free up funds that need to be invested in more advanced architectural solutions, for example, such that they allow maintenance with minimal intervention.
The second principle is that trying to maintain reliability even at the level of “five nines” - for a large data center, this is expensive and, in general, unrealistic. Instead, we need to design the infrastructure so that the subsystems can fail, but the system as a whole continues to work. The necessary software and hardware solutions are already available on the market, but for the time being they are not typical in corporate systems.