During my work, I often met with problems of lack of resources of corporate data centers, which can be formulated, for example, as follows: “We do not have enough physical space to house equipment,” “We do not have enough power,” and so on and so forth. Solving such problems "in the forehead" leads to an obvious answer - turn off and decommission part of IT equipment, or replace equipment with a more efficient in terms of performance / consumption / physical dimensions.
In most cases, it turns out that resources are actually abundant, but they are used, to put it mildly, wasteful. The problem often lies in the banal razdolbaystvo or the development of a corporate data center expansively, so to speak, according to inherited principles. Decisions made are not checked for effective use of available resources, organizations do not have methods for checking them and, as a result, we get what we get.
If you understand for yourself that it is impossible to continue living like this, I recommend starting with reading blogs of such companies as Kroc, Beeline, Data Line. They can find articles where they share their experiences in the field of energy efficiency. Their methods work - PUE of commercial sites is in the range of 1.3-1.4 (even for someone smaller), which is an excellent result with TIER III. However, at some point you will realize that they have their own party there with megawatts, reserves and experienced staff. And you do not place on it.
')
What to do to mere mortals, in which the data center - is 10 racks, 200 kW of power, there is always not enough hands and time?
Ideally, you need an easy to understand checklist, which you take in your hands and go for a walk on your site, making a mark. It is desirable that this document helps you, at least approximately, assess the impact of the proposed method on efficiency (you do not have experience and best practices). It would be nice if the proposed methods were divided into stages of the life cycle. You gathered, for example, to purchase servers and storage systems, looked into the appropriate section of the training manual, and there recommendations on the parameters of the purchased iron.
In general, I will not torment, there is such a document called the “EU Code of Conduct on Data Centers”. I’ll say right away that I have almost never met people who are guided by him in their activities, which really surprises me. Lies in
open access .
So, what is this document, and why it will be useful to you:
- This is a collection of best practices in the field of improving the efficiency of data centers, which was written by experts from various fields.
- It is well structured by stages of the data center life cycle, which allows you to easily prepare for the replacement of, for example, IT equipment.
- It is well structured by subsystems. Therefore, if you have a server maintenance team, they can easily evaluate their input.
- Any practice has an estimate of potential impact (from 1 to 5, 1 is a small influence, 5 is maximum). This will allow you to make an assessment at the planning stage, based on the cost of implementation and the expected return.
I propose to run through the document, understand how to work with it and consider a couple of examples.
But first, a little warning. Reliability and energy efficiency are two parameters that often pull your data center in different directions (not always, but often). As an example - the temperature increase in the data center. Leads to reduced consumption of air conditioners. But at the same time, we are seeing an increase in the number of revolutions of the cooling fans in the servers, which leads to an increase in server consumption (oops ...). And it reduces the resource of the fans themselves, and when it ends, the fans will stand up, and behind them the server will also be temperature dependent. Therefore, any change should be approached carefully, track its impact on adjacent systems and always have a plan for rolling back to the original positions.
So, we take the dictionary, we start reading. Immediately go to paragraph 2.2 on page 3, where the color coding of the practices is decoded.
Green - approaches, audits, monitoring, etc. The most effective in terms of material investment items. Most of them assume either minimal investments (5.1.4 Installing blank panels in cabinets) or zero investments in general by changing the approaches in operation (4.3.1. Auditing unused equipment).
Red - the introduction of new software. Complete nonsense, such as "see that the processes in the background do not hang and do not load the CPU." You can safely skip. Although, if you have hundreds of applications ...
Yellow - what to look for when purchasing new IT equipment.
Blue - what needs to be done at the next reconstruction or maintenance. There are examples of the so-called "retrofit", i.e. enhancements to existing devices. For example, when replacing the UPS batteries, replace the lead ones with Li-Ion, which will make it possible to abandon the air-conditioning system and free up part of the area. Or, when servicing the air conditioner, install a speed control device.
White - optional practices, compliance with which is not required for candidates.
It requires a small digression. This training manual was created for operators who want to join the voluntary program “The European Code of Conduct for Data Centers”. Therefore, the term “candidate” is found throughout the document, which should not confuse you. The "white" practices provide good recommendations on approaches to the operation and construction of the data center.
Next, jump right to page 9 to chapter number 3. Further movement of the document should be carried out sequentially. Subsystems are described in order of their impact on the data center power consumption (IT equipment, cold, power supply, etc.).
Let's try to apply and mentally test the practices of different colors from different subsystems.
Green, clause 4.3.1. Impact - 5. It is recommended to audit the equipment used, its installation sites and services it provides. No matter how ridiculous this may sound, in many organizations I have come across a situation where all engineers shrugged the question “what is this server?” And this is in the server, where 30 servers, the maximum. And this is not to mention the servers that run the service used by 3 people in the organization. Seriously, especially if you recently came to the company, look at the server park from this point of view.
Clause 4.3.2 looks natural. Impact - 5. “Take unused equipment out of service and regularly audit for unloaded devices.”
Remarkable item 4.3.8. Impact - 4. “Perform an audit for environmental requirements of equipment. Mark this equipment for replacement or transfer. " Suppose you have several fresh servers, for example, under ERP. And a bit older, with strict requirements for temperature - no higher than 25 degrees. They stand and work for themselves, but they do not allow you to raise the temperature in the machine room. Then one day the ERP, which is spinning on fresh servers, has grown and requires more powerful hardware. A new server is being bought, which replaces a couple of previous ones. In this case, the training manual recommends that the replaced server be not uploaded to e-bay, but instead replaced by ancient machines that have temperature limitations. Those. in fact, you are migrating to a new hardware not of one service, but several with decommissioning of the oldest iron. Although you did the upgrade for the ERP. In general, look deeper and farther.
“Green” clause 5.1.4 Installing blank panels in cabinets. And with it 5.1.7 and 5.1.8. With minimal cost, you can seriously reduce the mixing of hot and cold air and increase cooling efficiency.
We now turn to the section on mechanical systems (cooling). Clause 5.1.2. Influence - 5. This item suggests us to separate the flow of hot and cold air through the use of containerization of cold and hot air. The practice of "blue", i.e. retrofit. Despite the fact that the manual recommends upgrading during periods of planned downtime, it is possible to carry out these works specifically on a working data center, since you are only concerned with cabinet constructions. Now there are solutions for the construction of insulating corridors with virtually no tools and no drilling. And once again I will remind about interrelations. Containerization was done - reconsider the settings of air conditioners, for sure it will be possible, at a minimum, to increase the setpoint temperature of the supplied air. And you can immediately make a note on paragraphs 5.4.2.4 (Influence - 2) and 5.5.1 (Influence - 4) Equip the internal blocks with a smooth adjustment of the speed of rotation of the fans and compressors.
The "yellow" practices are almost completely concentrated in chapters 4.1 and 4.2. They relate mainly to the procurement of IT equipment. It so happens that engineering systems have a lifespan of at least 10 years. And what you have now, you can only upgrade (ie, "blue" practices). IT equipment changes much more often, there is an opportunity to apply "yellow" practices in the next quarter. As an example I will give the following recommendations. "When drawing up the terms of reference for the purchase of a new iron, pay attention to the temperature conditions of operation." In this way, you can create a basis for introducing energy management methods without the limitations that your servers, storage systems, etc. create. "Require the presence of built-in monitoring of energy consumption and temperature at the server air intake." This will allow you to gradually move from the assessment of resources based on passport data, to an assessment based on real-time data. Naturally, all this will require changes in the approaches to monitoring and reporting, which are spelled out in Chapter 9.
I do not consider “red” practices in view of my dismissive attitude towards them. I would be glad if in the comments someone can demonstrate their effectiveness.
“White” practices are absolute hardcore for a corporate data center. Everywhere there are slogans "Give us a class A4 ASHRAE!", "Blow air directly from the street!", "You use a UPS - not a man!". This is the case when energy efficiency games reduce reliability.
Summary:
- The proposed practices are fairly simple to understand and implement, not rocket science. You can start right now.
- At the very beginning, pay attention to the "green" techniques. They have great influence, are simple, cheap and will allow to change the approach to planning and operation. That in most neglected cases gives a quick visible effect.
- Naturally, the movement should go from the most influential (5) to the least (1).
- Make a plan. As a result of the introduction of "green" techniques, you will get a complete picture of what you have now. Including an understanding of the technologies you use. Create a modernization plan for all subsystems that you use, with indication of the items from the training manual. Make a budget estimate of the changes, apply correction factors based on the impact of the techniques, and you will receive a plan of priority actions.
- Do not forget about the connection systems and monitor the mutual influence. To do this, start monitoring everything your arms will reach.
And I almost forgot about the case from the title.
Company X was asked to calculate the budget for the expansion of the corporate data center for additional space. They needed to put 2 high-loaded racks. According to them, there was no physical place to place the racks in the existing workshop, there were no cold stocks, the UPS worked at 85% of the power at the peak and they were not enough. We estimated the budget, it turned out the same pile of money. Let's go watch the site. During the inspection, the following were revealed:
- 1. In the machine room on 40 racks, air distribution through the raised floor was used. At the same time there was no air insulation system, in the cabinets there were a lot of empty units not closed with plugs. With the cooling capacity of the existing system has become more or less clear. At the same time, a solution to the problem with physical placement appeared.
- 2. We looked at the UPS logs and saw that the load on the UPS is increasing at night. Logically, it should decrease, or remain, plus or minus, the same. Very similar to creating backups, updating some databases or applications. However, it turned out that applications are updated only on weekends, the bases live on their own, and the backup goes in real time to another site for two years now. In theory. In practice, it turned out that some bad people did not decommission part of the infrastructure responsible for the reservation. Ibid on the spot they considered that by turning off unnecessary hardware, we would get the required kilowatts.
- 3. They asked the question: “Will you order the audit, or did you understand everything yourself?”. “Understood, understood,” they replied, and disappeared for a long time.
After our conversation, the customer by forces of 2 of his engineers for a couple of weeks scattered a mess, which was saved for 2 years. Were ordered and manufactured structures for insulation of cold corridors, plugs in the cabinets. Backup glands were physically decommissioned, in the process they found several more unused servers. Cleaned the wires under the raised floor. As a result, we got our necessary kilowatts and units even with a margin. Our expenses amounted to 3,131 rubles. for gasoline and working time. But we did not exhibit them to the customer, because it is uncivilized.
And then they never put up their high-loaded racks.