📜 ⬆️ ⬇️

Data Center and Digital Transformation

The power and cooling infrastructure of the data center today generates about three times more data and messages than 10 years ago. Traditional remote monitoring tools are not designed for such information flow. Let's try to figure out how to extract valuable information from the large amount of data received and to increase the efficiency of data center operation, what potential it opens up.



Previous monitoring tools are significantly different from modern ones using cloud computing, analytics and mobile applications. Navigators, fitness trackers have become familiar gadgets, but in most data centers modern technologies such as big data analytics and machine learning are still not used, although they can literally revolutionize the use of data centers. By analogy with the now popular term “digital transformation”, we will call monitoring a new generation of digital monitoring.
Function
Traditional remote monitoring
Digital remote monitoring
Online mode
Not
Yes
Remote diagnostics
Usually not
Yes
Network Operations Center (NOC)
Yes
Yes
Incident Tracking
Seldom
Yes
Analytics
Not
Yes
Mobile application with live notifications
Not
Yes
Chat
Not
Yes
Real time monitoring
Not
Yes
Secure network connection
No network
Yes
Cloud storage
Not
Yes
Status "at performance"
Not
Yes
Supported devices
Usually UPS
All SNMP devices

The main difference between digital monitoring and the usual one is a permanent connection to a computer via a dedicated channel or via the Internet and the use of the most modern technologies - from machine learning to the Internet of things. Traditional monitoring is not an online service and does not work in real time. It only notifies you of a change in status, usually by mail.

Digital monitoring works online: a permanent connection to the data center (usually through a gateway) allows you to work in real time. In addition, it uses such IT services as cloud storage and data analytics.
')
Previous monitoring tools were based on PCs, allowed to collect and present a limited amount of data and basically made it possible only to respond to the situation depending on the interpretation of the information received. Digital remote monitoring removes these limitations.

Who owns the information ...


The following trends affect the data center monitoring today: productive and cost-effective embedded systems, cybersecurity, cloud computing, big data analytics, mobile computing, machine learning.

Embedded systems are used in almost all data center equipment, including cooling systems, UPS, remote controls, chillers, etc., controlling its operation. They provide data for monitoring. In recent years, these systems have been significantly improved in terms of computing and communication capabilities, data storage. At the same time, they have become cheaper. As a result, devices used in the data center generate much more data - at least three times more than a decade ago.


The more data - the more they carry useful information.

However, cybersecurity is becoming an increasingly serious problem. And this concerns not only the vulnerability of IT equipment, but also the data center infrastructure systems. Digital remote monitoring and other cloud services must initially consider these risks, starting from the design phase and ending with security policies. Typically, a gateway (usually a software gateway) is used as the network entry point, and all devices work through it.


Recommended digital monitoring architecture.

Clouds are a highly scalable method of storing and processing data. Cloud computing is the basis of remote monitoring services. Services such as predictive analytics and machine learning can operate in the cloud, unlocking the potential of remote data center monitoring and providing it with more valuable features.


Using machine learning, you can, for example, simulate PUE of a very complex data center like a Goggle data center.

Big data analytics may seem exotic, but already today it is used in such services as preventive maintenance and capacity planning. The need for it arises when data volumes grow to petabyte values, they become unstructured or require real-time processing. With data analytics, methods of machine learning are associated, allowing to build forecasts based on previously obtained results.


Automation and mobile applications facilitate the work of data center administrators and allows you to do more with less.

Do not drown in the sea of ​​data


With the increasing amount of data and information flow, it becomes more difficult for data center administrators to make the right decisions. Here are just some of the problems:



A unified monitoring platform simplifies problem determination and troubleshooting.

The digital remote monitoring service helps to overcome these problems and provide the following benefits:


Monitoring center


The task of monitoring centers is to reduce the risk of downtime by identifying and eliminating one situation before it entails another. In this context, the digital remote monitoring service should meet the following requirements:


Network Operations Center (NOC). It employs data center support experts.



The service engineer must know what exactly he will need to replace or fix in order not to go to the object again.

What should be the remote digital monitoring service?


The following requirements will help the remote digital monitoring service to improve work efficiency and help its staff concentrate on the most important tasks.


Chats, messengers, etc. can also be useful. Chats and messengers not only help you work in a team, but also quickly communicate with experts in the NOC.

Quick start-up means that within about 30 minutes you can install a gateway, set up automatic device discovery, register software, configure the application, and start monitoring the data center.

Setting all devices to be monitored manually means losing a lot of time. It also increases the likelihood of errors. Digital monitoring system for automatic detection of infrastructure-critical devices uses the protocol (SNMP). However, Modbus TCP devices are usually not automatically recognized - you need a definition file (Device Definition File, DDF). As a rule, gateways scan a specified range of IP addresses, recognize the corresponding devices and present this data to the user.

Events are processed by priority - the most critical first. This practice reduces the burden on data center operators, who know that NOC experts will be warned and will understand the situation when several events occur.

Analysis of the correlation and causes of events allows you to evaluate multiple alarms, minimize possible causes and propose solutions. This correlation process can be carried out by NOC experts or implemented as a combination of machine learning and expert assessments.

The consolidation of alarm messages turns several messages from one device into one incident. This will allow not to waste time on several identical messages. Moreover, for the incident, you can automatically generate a request for rectification of a malfunction, inform about who is dealing with this issue now and for how long that not the current moment has been done, monitor the progress of work until the final resolution.

The alarm context can contain such useful information as a source of information (for example, a rack number), which systems it concerns, and what exactly should be checked. All this information can be obtained in the mobile application.

Everyone who tried to find a solution to the problem on the Internet, certainly knows how many posts from different users have to study to find the right answer. Such “crowdsourcing” takes a lot of time. Remote digital monitoring services should be complemented with their own online communities.




An idle event usually results in not one event, but their sequence.

Energy efficiency


The greater the number of devices covered by monitoring, there are more opportunities to improve the energy efficiency of the data center. However, in order to make useful conclusions about the efficiency of the data center, it is necessary, at a minimum, to measure the load on the output of the UPS. Without knowing the basic values ​​of the power consumed by IT equipment, it is impossible to determine its cooling needs. For example, if a chiller began to consume more energy, it is not clear that this is a chiller problem or a consequence of an increase in IT load. Having more complete data, you can compare the total power consumption of all devices and cooling parameters, to identify anomalies.


The coefficient of energy efficiency PUE allows to quantify the excess capacity at a given IT load.

An even more effective method is real-time PUE measurement . With proper implementation of the approach, you can receive reports on trends in energy efficiency and generate messages when conditions change. Moreover, an effective system allows you to identify the sources of problems and correct the situation. Monitoring in this case can be performed by NOC personnel.

Real-time PUE monitoring.

Scalability


Scalability is the ability of a remote digital monitoring system to control an increasing number of devices (nodes). Depending on the system architecture, there may be thousands of devices. For small data centers with IT loads up to 500 kW, scalability is usually not a problem, unlike large data centers, where the number of devices can reach hundreds of thousands, and readings are taken every few seconds.

In this case, the monitoring system should use a horizontally scalable cloud architecture. Cloud service allows you to automatically add compute nodes to process additional data. A promising direction - the technology of the Internet of things (IoT).

New approaches to operation


In the future, the data center will be much less dependent on the "human factor" - possible errors. Automation and machine learning will help. The more data is collected about the causes of downtime, the better the monitoring system will be able to predict the likelihood of downtime and recommend steps to prevent it.


The efficiency of data center operation can be improved due to more accurate models and the accumulation of data on the actual operations of different data centers.

A data center model using machine learning will have enough information to fine tune the cooling system and minimize power consumption. The simulation will also predict power consumption.

Through the mobile application, the data center administrator will receive a notification, if something goes wrong, see what steps he must take to correct the situation. In more complex procedures, you can use virtual reality technology.

The collection of a variety of data will allow you to go to the data center from planned maintenance to situational. Numerous sensors and algorithms will help predict component failures, make the generated messages more comprehensible, and ultimately reduce maintenance costs. And analytics of big data will allow manufacturers to increase the reliability of manufactured components.


The digital remote monitoring service will automatically generate work orders by field engineers.

Life support systems of a data center consist of sophisticated equipment and require special attention. These are systems of cold supply and air conditioning, fire extinguishing, power supply, telecommunications and structured cable networks. In a data center built in accordance with the requirements of Tier III, maintenance or repair of any infrastructure element can be conducted without stopping the data center operation and without reducing the operating capacity: all equipment is reserved by the N + 1 system, which allows us to speak about the availability of the facility at 99.982%.


Ultimately, all this translates into a decrease in data center idle time and an increase in its reliability.

The data center monitoring system helps improve operational efficiency by providing information support for the IT service. The task of the modern monitoring system is not just in fixing the emergency situation and promptly notifying about it, but in the possibility of proactive observation, analytics, allowing to prevent incidents. For example, if an equipment component fails, such a system immediately initiates the process of its replacement, up to the application for the purchase of a new one, if necessary. The digital remote monitoring service will allow you to use your valuable analytics and situational services at an even higher level. This future will come very quickly.

Source: https://habr.com/ru/post/314612/


All Articles