
An increasing number of domestic companies today are faced with the problem of selecting data centers that meet all the needs of their business - either for renting IT infrastructure, or for hosting and centralized maintenance of their own equipment. Of course, each company has its own data center reliability criteria. In some ways, they are similar, in some ways they differ, but there is one general requirement: all components of the IT infrastructure should work stably, otherwise the company will function inefficiently at best, and at worst - many business processes will simply stop.
In this article I want to talk about what you need to pay special attention when choosing a data center and what questions should be asked to get a fairly complete picture of the level of reliability of the data center, without relying on the operator's statements about compliance with Tier standards.
Tier classification itself implies four levels of data center reliability.
Data Center Reliability Levels
| Data Center Availability
| Data center downtime per year
|
Level i
| 99.671%
| 28.8 hours
|
Level II
| 99.749%
| 22 hours
|
Level III
| 99.982%
| 1.6 hours
|
Level IV
| 99.995%
| 0.4 hours
|
Of course, when choosing a data center, it’s best to ask for help from a consulting company that will conduct the required audit of the sites you have chosen and make a conclusion about the suitability or inappropriateness of a particular data center for your business. In Russia, this type of consulting is becoming increasingly popular, but the vast majority of companies still prefer to save on such a business-important service and conduct data center surveys on their own.
Infrastructure Failover
As a rule, most data center operators are limited to a general assessment of the level of fault tolerance of their facility, although often not all systems and data center subsystems have a stated redundancy scheme. Of course, in data processing centers that have successfully passed the Uptime Institute certification, the reliability level of all engineering systems fully complies with the established standard, but at the time of writing, only two data centers in Russia officially certified projects (both in terms of Tier III fault tolerance) and implemented engineering solutions: Data Center "South Port" of Sberbank and "DataSpace". Although (and it is important to understand) in Russia, even certification of the now-respected American Uptime Institute does not guarantee the continuity of services, especially in the event of an accident. But this is the subject of a separate conversation, and today we will talk about hundreds of Russian data centers that have not been certified but are famously using Tier terms, declaring a high level of reliability of their infrastructure.
In order to understand how the reliability of the data center corresponds to the one declared by the operator, create a table with a list of key components of the data center infrastructure and send it to fill in the candidates selected by you.
')
Below is a brief list of questions that I recommend to get answers from the data center operator.
Architectural part :- the owner of the building (room) in which the data center is located, the lease term;
- overlapping load capacity;
- finishing materials used in the decoration of walls and ceilings;
- availability of freight elevator and loading and unloading area;
- fire resistance of walls and doors.
Power supply system:- the number of inputs from the transformer substation, capacity and category;
- the number of inputs from different transformer substations and the amount of use of each;
- DGU availability, power, non-refueling operation time, start-up time and time to gain full power, availability of fuel supply contracts, reservation level;
- availability of UPS, battery life, level of redundancy;
- connection diagram of air conditioners to the power supply.
Air conditioning systems :- used air conditioners, manufacturer, quantity and level of redundancy;
- temperature conditions;
- the presence of smoke removal systems and pressure relief valves.
Automatic fire extinguishing system :- availability of automatic fire extinguishing system, type of extinguishing agent, availability of reserves;
- availability of fire alarm system, number and types of sensors.
Security systems :- availability of an access control system;
- the presence of a video surveillance system;
- access to the site.
Technical support :- the number of specialists and engineers present at the site during working and non-working hours;
- the mode of operation of technical support staff;
- response time to the request;
- the presence of a multichannel phone, a ticket system, a web interface.
Emergency plan
After receiving from all the data center operators you are interested in, you need to visit the object and see everything with your own eyes with a description of the infrastructure. Before the visit, you should agree in advance that the attendant should include a competent representative of the operator’s technical service who is able to answer most of your questions.
During the tour do not hesitate to ask questions about the actions of the staff on duty in a regular and emergency situation. Model various emergencies and ask them to tell you what the engineers on duty will do every minute in these cases both during working hours and during non-working hours. This will help to understand how prepared and trained the technical specialists of this operator.
An important condition for confirming the declared reliability class is that the operator has step-by-step instructions for the personnel on duty in emergency situations. Be sure to read these instructions: you will understand in what time it will be possible to eliminate typical and non-standard emergencies.
Having visited many data centers as a potential customer, I regret to state that the data center operators pay very little attention to the preparation of such emergency plans. Very few have the appropriate documentation, and even fewer those operators for whom they are relevant and correspond to the staffing table.
Ask if there is a 24-hour technical support service at the facility, how many specialists are there, and what functional responsibilities are assigned to them. Most often, there are engineers on the site who can perform only elementary actions: press the button to reboot the server, connect the CME, and for solving more serious tasks during non-working hours, the duty personnel are called from home. As you understand, this will increase the period of elimination of the accident at least for the time while a competent employee goes to the data center.
Emergency response drills
Of course, a process flow chart of procedures that is not supported by practical experience is unlikely to be useful in the event of an emergency. Such documents should be constantly improved and updated in accordance with the results of comprehensive exercises and trainings on the prevention and elimination of emergencies, which it is desirable to conduct at least two or three times a year.
Regular training of employees and simulations of various emergency situations directly indicate the training of data center personnel and the responsible approach of the operator to the operation of the facility. If data center operators who have developed real regulations for personnel actions in emergencies do not occur too often, then it is even more difficult to find operators conducting exercises constantly: many are limited to a test launch of a diesel generator set once a month.
In recent years, the number of new data centers has been growing exponentially, and there are still few competent specialists with real practical knowledge in the field of data center operation. Therefore, the owners of new sites sometimes try to acquire the necessary knowledge in the process of operation, which inevitably leads to a halt in the work of the data center. In Russia, for some reason, it is customary to cope with most of the problems on their own, and to attract professionals only when an emergency situation has already occurred.
Preventive Repair and Maintenance
Routine preventive maintenance of infrastructure will minimize the risk of accidents.
Ensure that the data center operator carries out preventive maintenance as required by the regulations. To do this, ask to familiarize you with the magazines, in which all the events taking place in the data center are noted, and also the measures for the current maintenance of the equipment are recorded. There should be several such magazines:
- Log of acceptance of duty on the data center.
- Datacenter visitors log.
- Journal of the removal and removal of equipment and material values.
- Journal of daily inspections, including sections:
a) an external inspection of the technological equipment of the data center (doors, hatches, turnstiles, raised floors, technological platforms and corridors, the appearance of IT equipment);
b) control of environmental parameters (temperature, humidity);
c) control of energy consumption (fixation of meter readings at the input and ammeters on the tires in phases);
d) control of water consumption (fixation of meter readings at the input).
- The maintenance journal of ITIS Data Center, which contains information on equipment malfunctions, inspections, maintenance and repair of all infrastructure systems in accordance with its main components:
a) security systems complex (KSB):
- security and alarm system (SOTS) - information about the planned (monthly) performance check, fixing false positives during operation, notes about the replacement of failed elements;
- access control system (ACS) - fixing access denials and false positives during operation, marks about the replacement of failed elements;
- Inspection equipment (DT) - information about the planned (monthly) performance check, the fixation of false positives during operation, marks about the replacement of defective items;
- the system of television surveillance (STN) - marks about the replacement of failed elements;
- the central dispatching post (CDP) - fixing of failures in service, marks about replacement of the failed elements;
b) a complex of fire protection systems (KSPZ):
- automatic fire alarm system (SAPS) - information about the planned (monthly) performance check, fixing of false alarms during operation, notes about the replacement of failed elements;
- the system of the loud notification about the fire and evacuation management (SGO) - information about the planned (monthly) performance check, notes about the replacement of the failed elements;
- automatic gas fire extinguishing system (SAPP) - information on planned (monthly) performance check, data on pressure monitoring in the system, marks on refueling of IHL and replacement of failed elements;
- Smoke removal system and air overpressure (DPS) - information about the planned (monthly) performance check, notes about the replacement of failed elements;
- means of individual respiratory protection (RPE) - information on the manufacturer’s (monthly) check of the seals on self-rescuers, marks of replacement after the expiration date;
c) complex of communication systems, telecommunications (KSVTS):
- the structured cable system (SCS) - the log of cable connections and information about its changes;
- the system of electrochasification (MF) - marks about the replacement of failed elements;
g) complex of electrical equipment systems (CSE):
- system of protective and technological grounding (SZ) - data of planned (annual) measurement of parameters, information on connection broach (performed as needed, but at least once a year);
- system of dedicated power supply (SHE) - data of monitoring the temperature of busbars, measuring the parameters of electrical cables (insulation), information on connection pulling (performed as needed, but at least once a year);
- the system of guaranteed power supply (as part of the system of electronic mail system) - marks about the outage, conducted by a specialized organization (outsourcing) according to its own maintenance schedule
- a backup power supply system (as part of the AE system) - marks about the outage, conducted by a specialized organization (outsourcing) according to its own service schedule;
- the system of the main electric lighting (COO) - the data of monitoring the parameters of the illumination, marks about the replacement of the failed elements;
- system of emergency (duty) electric illumination (SAO) - data of monitoring the parameters of illumination, marks about the replacement of failed elements;
e) a complex of engineering and technical systems (KITS):
- Precision air conditioning (microclimate) in the data center (SPM) - data monitoring temperature, humidity, pressure in the system, marks about the replacement of air filters, the prevention of steam generators;
- a ventilation and air conditioning system in the data center premises with permanent workplaces (ICS) - data for temperature control, air velocity, pressure in the system, marks on the replacement of air filters, cleaning of air ducts;
- process water preparation system (SPV) - water quality control data, marks about refilling filters with reagents, changing filters.
Informing
No matter how high the duty shift was, no matter how well the various procedures for the operation of the data center were worked out, emergency situations are still unavoidable. For you, as a customer, it is important to be on time notified of those accidents in the data center that may adversely affect the operation of your equipment. Timely information will reduce the time to restore the IT infrastructure. Find out which communication media (Internet, telephones) are used, what principle is used to notify clients, which information systems are used by the data center operator for this, where they are located and how long you will be notified of an emergency.
It would not be superfluous to ask how the systems that generate customer information are reserved, and how the alert will be implemented if the entire data center is de-energized.
Of course, all of the above recommendations for selecting a data center require serious study and time-consuming, but, having figured out all the nuances, you will be able with high probability to determine a good platform for hosting your IT infrastructure.
I wish you success in this difficult search!
Author: Alexey Degtyarev,
TsODy.RF journal, issue 1