There are two main documents that are most often mentioned when discussing data center standards: these are the
TIA 942 standard and the level classification from the
Uptime Institute . Both of these documents regulate levels (Tier), which often leads to confusion: for example, the Tier III by TIA 942 and Tier III by the Uptime Institute are two big differences.
TIA vs Uptime
TIA 942 - Telecommunications Industry Association - Telecommunications Infrastructure Standard for Data Centers:
- This standard was developed by the United States Telecommunications Industry Association and, above all, concerns the organization of structured cabling systems in the data center, and to a lesser extent, fault tolerance and other engineering subsystems.
- Is advisory in nature.
- There are step-by-step instructions and recommended schemes (assistance to the engineer). "Do as it is written here and get a good result."
- Compliance with the standard is declared by the owner of the object or the executor of the project (at the level “I did what you said, honestly ”).
- Usually, only project documentation is checked for compliance with the standard.
- Once implemented, the object does not lose the level.
')
Uptime Institute - Tier Classifications Define Site Infrastructure Performance
- This document is not a standard, but rather a methodology developed specifically for standardizing the fault tolerance of the data center. For example, the telecommunications infrastructure is practically not considered.
- Is mandatory (if you want to get a certificate, of course).
- There are no step-by-step instructions (they quickly become obsolete), but there are formulated basic design principles and approaches. "Do according to these principles and get a fault-tolerant object."
- Certification is carried out only by the Uptime Institute.
- Both the project and the result obtained are certified (running platform).
- It is checked what exactly happened as a result - without much emphasis on how this result was achieved, that is, flexibility is allowed in the design plan in a particular situation (if it plays into the result).
- First, the project is certified (Tier Certification of Design Documents), then the finished site (Tier Certification of Constructed Facility), and then regularly, at intervals, for example, once a year, three or five already operational (Operational Sustainability Certification) for its compliance standard. The latter is done to evaluate the operation, monitor the equipment life and other things that change in the process.
At the same time, the classification of levels in TIA 942 was proposed by the Uptime Institute, and in essence they are very similar. In this case, the principles of evaluation are fundamentally different. Once again: TIA says, “Do exactly as it is written, and everything will be OK,” the Uptime Institute says, “You should have everything OK by any methods, in accordance with the given principles, and then we will check that it works.”
Levels I-IV
Essentially, both for the TIA 942 standard and for the Uptime Institute methodology, the classification by levels is the
same . Roughly describe them as follows:
- Tier I - without reservation. Availability of 99.671%.
- Tier II - backup critical nodes. Availability of 99.741%.
- Tier III - reservation of critical nodes, ways of obtaining electricity and routes of delivery of the coolant. At the same time, it is possible to decommission any node for maintenance, while maintaining the full functionality of the facility as a whole. Availability 99.982%
- Tier IV is the most fault-tolerant level, where one accident is allowed (and not a planned decommissioning of a node) at a time. As an example of an accident - a critical human error. In essence, these are two Tier-2s that are built in the same building around the server racks. Availability 99.995%, which provides downtime of just 26 minutes per year.
As an example: if we make a system with the delivery of a heat transfer fluid through pipes, a double ring should be made in Tier III, and you can do with one in Tier II. In this case, the redundancy level of chillers and fan coils may be the same. The same goes for power supply and other systems. Level IV is even cooler: for example, UPSs and power lines should not just be duplicated, but also spaced into different rooms: if the first unit explodes (an emergency, not a planned stop), then the second should not suffer. If the pipeline breaks through in some place, it does not affect the duplicating electronics in any way - there is a physical separation of the systems.
Speaking in narrow-minded language (very rudely), the levels look like this: the first works and can refuse, the second as a whole works normally and withstands some of the most common failures, the third survives in any non-critical conditions, the fourth is suitable for work in military conditions.
At the same time, for the United States, the cost of the object fluctuates as follows: 30K, 50K, 65K and 100K dollars per rack (these are very approximate figures for estimating the cost ratio between the levels). In Russia, usually even more expensive. Thus, if you choose between Tier II and Tier III, the budget does not increase very much, but uptime - more than. But the question is not even in costs, but how well everything is designed and protected from operational problems on the spot.
Why do we need these standards?
Thinking about the standards of classification of data centers in the early 90s: then the Uptime Institute began to write down on paper the basic principles of building fault-tolerant objects. The task of the Uptime Institute was to study the methodology for the construction of reliable high-tech facilities and to investigate every problem that resulted in a failure in the data center. At the time of launch, the company had documented experience in building data centers and their “warm tube counterparts” since the 1970s, and those computer centers were very large and completely fault-tolerant. In these centers, there was also a statistic of the main problems: starting from the famous moth and ending with all sorts of minor repairs.
As a result, approximately in the 95th year, a data center classification was proposed by levels based on their fault tolerance. This classification was proposed so that customers could choose the infrastructure that meets their needs according to the task. Roughly speaking, if the customer builds a call center, then he does not have to think about the availability of four nines (99.99% uptime), but if the data center where systems are critical for the bank’s business, then yes, then it’s worth it. This classification was taken into account in the first edition of the TIA 942.
In 1996, the first document appeared describing the requirements for the engineering infrastructure of computer centers according to the methodology of the Uptime Institute. The main four levels were introduced based on failure statistics and organization experience. The fault tolerance level indicated a possible uptime, and without intermediate stages: that is, there were no II + and III + and there wasn’t - even if it had not reached due to one non-redundant valve on a not-so-important backup system up to three - a two was assigned anyway. Actually, this is how the level is assigned now, so the words about the Tier II + is a personal fantasy of the owner, and it has nothing to do with the standard itself.
The basic concepts with which documents operate are redundancy, the ability to service nodes without stopping the operation of an object as a whole, plus resistance to failures and accidents. At the same time, a number of things that are very unusual for our reality are postulated: for example, according to the Uptime standard, it is considered that at levels I and II network power from the urban network can be the main source of electricity generation, but for levels III and IV it is not. A city at this level of standard suddenly ceases to be reliable and is considered only as a cost-effective additional power source. At the same time, the DGS system must ensure operation at full power, with no duration limitations.
The purpose of creating a TIA is to help design engineers not to invent something of their own, but to design it in the manner suggested by the standard, which takes into account the experience of creating so many large objects. The standard illustrates and describes the best techniques and solutions. For its part, Uptime focuses on the principles, the implementation of which can achieve a given fault tolerance.
That's the difference : TIA shows in great detail how to organize structured cabling systems, information links, and another engineer (which is logical, since in these things tips from best practice are quite important). At the same time, Uptime does not focus on SCS or power supply, for example, but examines the influence of all engineering systems on the fault tolerance of equipment in the data center as a whole. Or again (refuting one of the most common misconceptions): Uptime, in fact, does not regulate the choice of site, there is only an application in the spirit of "we noticed that Tier IV data centers usually have such sites, III has such and so on. P."
Practice
In our practice of preparing data centers for Uptime certification, several "unexpected ones" have surfaced. For example, when we had our own data center certified by Tier III, it took quite specifically to organize the management of synchronization of diesel generators (the
details are right here ) - in fact, few people in Russia even thought about it. Or another example “out of the unexpected”: when designing uninterruptible power systems, they usually look at the type of battery, capacity, tightness, maintainability, and so on — that is, only the basic parameters of the batteries are considered. In fact, the design of the data center should take into account the more "subtle" characteristics. For example, the batteries also have different discharge curves (roughly speaking, different capacities at different discharge rates) - at part load everything is fine, but at full load the system will not be able to keep the set time, the diesel generator set will not have time to go to the required mode, and a failure will occur.
And here is an example from the practice of one of the customers: nobody gets to the bottom of the state of diesel fuel on paper. Roughly speaking, there are generators, there are backup lines for the delivery of fuel, and he is a solar diesel, the main thing is to refill it on time. The data center can be rated as TIA compliant. But in practice,
diesel fuel in our country has a pair of magical properties, and
diesel engines may well be choked . This discrepancy is checked at the operational level. Roughly speaking, the TIA will never have the question “what if there is water in the tank instead of diesel fuel?” And “when was the last time you changed the fuel?”. The Uptime Institute has a debag team that aims to test such things in practice. The guys took into account this fact and now they know not only about the fact that the fuel can suddenly fail (according to the methodology it is), but they also take into account how specifically.
It is clear, everything can not be checked. For example, there is always a
human factor that creates extremely unpredictable situations. Among the engineers goes bike, that even in the two thousand years in Israel, one of the data centers of a large IT-company stopped thanks to our compatriot in the new year. He celebrated the holiday, drank right at the shift, then continued. After midnight, the power from the city was gone, and diesel engines came up (no human involvement was required, automatic equipment worked). But the hero was very much disturbed by the noise, and he manually raised all the generators in an emergency to continue the rest in a comfortable atmosphere. There are no official confirmations of the story, but for some reason I believe in it, at least as an example of an extremely wild and illogical situation.
Automation
And finally - in the standards there are no recommendations on the organization of automation, triggered in emergency situations and recommendations on the organization of personnel such as emergency services. At home we use the good old "Soviet" approach, when everything is done very simply and reliably, almost on the relay: no complicated microcontrollers with their own logic and no "machine uprising". We bring automation to where the situation is unambiguous and we need a speed exceeding the speed of human reaction. At the same time, everything where a balanced solution is required is left for manual control. As a private example, automation switches from city to diesel. Translation from a diesel engine back to the city (with the diesel engine disconnected) is done strictly by hand on the installation, and not by clicking in the interface. The task is to ensure that important actions are not performed on “autopilot”: many accidents occur precisely because of what people first do and then think. Actually, I think that if there is a professional in the data center who does a good job, this is much more important and, more importantly, more reliable than the most intelligent engineering solutions.
Summary
So why can certified data center stand up? The answer is because with the same level names (for example, Tier II) there is a huge difference between project certification without on-site verification and certification of a working site with a specific on-site verification. If you do not fully understand how exactly the data center is certified (by TIA or Uptime), you should check the certification
here .
Yes, look, geek porn with our data center increased responsibility in the title role can
be here . Even if you have already been in this topic, you may note a couple more details after explaining what was done and why according to the methodology of the Uptime Institute.