We received GOLD for the operation of the data center TIER III - the final progress after T-III for the project and T-III for the finished object

Data centers are ranked by fault tolerance levels from I to IV. These levels are TIA (not requiring verification, just by application) and Uptime Institute (with strict certification). TIER III assumes the ability to work when any of the nodes fail anywhere in the infrastructure. If it is a pipe with a refrigerant, the second one should be the same. If it is a fuel tank, then there should be a second spare. If this is cooling, there should be reserves for chillers N + 1, etc.

At first, this compliance with the TIER III level was established by the project. We defended the documentation: roughly speaking, the engineers of Uptime “crossed out” any node and looked to see if the rest could work. This quest passes many.
')
The next step is to pass certification on the finished object, that is, to confirm the compliance of the documentation and the principles of fault tolerance already on the realized object. This is the most difficult thing in Russia, because to declare in the plan and build is two big differences. Customers who have already brought their production to the site added a special charm to the process. Therefore, the test passed is very cool.

The third step - we have received certification for operation. That is, they confirmed that the team and all processes comply with the principles of Uptime. There are only 2 such data centers in Russia.

What else do you need to know about these certificates

TIA TIER 3 get "just because" according to the statement "our project complies with the recommendations of TIA". Therefore, we do not consider this type further, but are talking about TIER III by the Uptime Institute.

There are three types of certificates: a project (given once for a project, burned two years later), an object (given for a constructed object and confirms the fact that what happened is still TIER III, not TIER II, for example). The certificate for the object is eternal. The third type is a certificate of operation, where the center level is regularly checked.

They check it once every 1-3 years, depending on the level of readiness that was shown during the last check. Such a frequency is a consequence of the general rule that, on average, 70-90% of downtime occur due to the fault of the human factor. That is, a data center 10 years ago without fresh confirmation of a certificate of operation can give any surprises. Regular certificates for use are divided into three types: Gold, Silver and Bronse. If you go through the quest without a hitch and without a hitch - they give Gold, it requires re-checking after 3 years. If passed with the comments "on the four" - then every 2 years Silver. Worst of all Bronse - is the passage on the "satisfactory" with a shelf life of the certificate in 1 year.

We got Gold.

How was the test

The guys from Uptime first came to certify the object to us (after we built it according to a certified project). At this point, it was too early to receive the third certificate of operation - in my opinion, it would take about a year after the data center was launched in order to fix all the processes and fully train the operation team.

A little later, we called them again with an audit before certification. The meaning of the audit is to check what is wrong, what needs to be improved and give a bunch of recommendations for improving work. In our case, that was the case.

Ten months later they came again for three days. The first few hours just walked around the object, orientated, looked into different angles and ran their fingers across hard-to-reach places, in every possible way rejoiced. Then the whole crowd sat in our premises for admins (a warm office with a kitchen) and overlaid with documentation. Two days only checked the correspondence of the pieces of paper to each other, plus people's knowledge about them.

Another type of activity was the name of certain engineers (for example, a dispatcher) and said: “Such an accident, what are you going to do?”. He was responsible for the rules of action, he was released.

What generally check for certification

Workload on staff. For example, we have long enough wool schedules for dispatchers, so that each of them worked out no more than it should be in TC for such a position. They checked every shift, murals in magazines (that this particular person was in shift) and then counted the monthly work.
Knowledge of emergency procedures (who does what).
Relevance of any formal certificates, diplomas, and so on to their positions. Who is responsible for the fire, for first aid, etc. - the relevance of knowledge.
Job descriptions and their relevance, description of all processes and procedures, instructions for each case.
Procedures for testing equipment and maintenance in general - so that all instructions are strictly followed and cover the necessary processes for a specific object. In our case - that all instructions correspond to the actual location of the units and cover all situations. Procedures for opening and closing shifts, entering equipment data, testing procedures, etc.
How is the training of staff and how to conduct regular training in emergency situations.
How is the internal library updated with the “operating experience”, how are the expansion processes for power, cooling, how equipment is carried out, etc.

In our situation, the data on personnel and shift journals were most important. At this certification, the equipment is touched to the minimum - it is assumed that everything was done at the stage of obtaining the Facility certificate.

Tips

As I said, it is better to go somewhere in a year after the start of operation by a new team, because one of the parts of the test is how people found the project’s flaws (or developed the data center from the project), how they studied the equipment and what they corrected “on the living” already on the launched data center.

From the shortcomings: for example, at the certification stage, it became clear that the most detailed instructions should be made. And we, for example, have 6 identical subsystems. The first has a detailed instruction for switching in case of an accident. The second was “do likewise 1” - you have to change, write exactly the same, only your instructions, so that nothing is mixed up in the place.

It is also important to properly execute all documents for improvements, incl. magazine upgrade. You need to understand that some changes in general may reduce the reliability of the data center as a whole.

We had some special surprises during the inspection. There is a list of requirements, which must be carefully studied, and imagine that each item will pick three paranoids at once. Before the pieces of paper, they “dig out” very strongly, which, on the whole, is correct - it’s just that nobody builds correlations between different documents on ordinary checks, and here, quite deep ones.

For example, after the excursion they asked us to unload the exact map of how and where they walked around the object - this is done by the access control system, by video surveillance.

Some more links about our data center:

Excursion to our “Compressor” object, where the train used to stop by
How TIER differs from other TIER, and TIA from UI
Stages of data center construction
How to exploit a high responsibility data center
And my mail for questions is AAshavskiy@croc.ru.

Actually, if you are preparing for such a check, I will be happy to answer questions in the comments.

Source: https://habr.com/ru/post/268419/

All Articles