How to stop looking for a good data center and start living

Sorry to upset you, colleagues, but my answer: no way. Or, as stated in one domestic advertising, "son, this is fantastic." No, in nature there are Tier IV DCs and data centers that enclose and execute SLAs, but this article is not about them (since their services are worth it, excuse the pun, transcendental money), but about providers of dedicated physical and virtual servers and services that most mere mortals currently use, including us at Pixonic.

I deliberately will not give the names of specific companies with whom I had to work both in Pixonic and before him, in order to bring the already voiced idea: one way or another, serious "nuances" are found in all. Maybe someone wants to argue, but my conclusion is based on more than 8 years of experience with technical support and data center management at various levels around the world.
')

Hedgehog - a proud bird: until you kick, it won't fly

I guess I’m not mistaken, summarizing the problem as follows: as long as every direct worker in all the data centers whose services you use are not personally interested in tightening their particular conditional nut, everything will be bad for your business. But to improve the situation and make your life easier can be, and quite simply.

Perhaps I’ll be voicing the obvious things now, but before Pixonic I’ve never seen them run somewhere. Here it turned out to be better, although at the very beginning there was a lack of a systematic approach. As a result, we formulated it, implemented it and it became really easier to live.

I will begin with a simple question: does your provider send you a link to a survey about the quality of their services with a request to spend 5 minutes to complete it - your thoughts? As a rule, "tired" - the most innocent of them, after which the request is simply ignored.

This happens because you paid money, you have your own business, your tasks, and you need to pay just to work. You do not have time to engage in their nonsense, after all! Familiar?

Let me tell you a secret: they are not perfect, but they are trying. That is why they do these polls, various interfaces and APIs of varying degrees of convenience, monitoring, notifications, and so on. Why not try and benefit you in your business if they provide some tools for this? Start by learning about the documentation and knowledge base for working with the web user interface and API. Specify contacts in the settings of your profile so that important information (such as warnings about planned work or abuse with the threat of blocking) does not fall into a common box that is unreadable by anyone, and reaches specific people who can respond to it adequately and in a timely manner. For example, technical, network and abuzny things - to administrators, and all issues of mutual settlements - in the accounting and financial control.

Write letters

Further, do not be lazy to understand each problem and write to the support. At first, if the problem does not burn, you just need to write, so that everything will be recorded. This is useful not only so that in the event of what was where to “poke the nose” of the next person on duty at the provider, but also, if necessary, show the ticket to your colleague. If your writers do not know English at a sufficient level, do something about it: attract colleagues, train, hire a translator (Google Translate will not help!), Do it yourself, in the end.

When communicating with a representative of the provider in writing, focus on the following points:

As a rule, engineers of different skill levels work on a problem at a time, and they are limited by their company’s internal regulations. Often the changer does not exchange information on the problem with the changeable - be prepared to repeat the same thing thoroughly and chew the problem many times;
if the time is mentioned in the communication, always specify both your time zone and the UTC time;
if the root of the problem is obvious to you, write step-by-step instructions down to the comma and in communication with the manager (about him below), try to insist that it be performed without question without any questions - as a rule, because of the regulations mentioned above, the engineer is obliged to you ask any strange, sort of collect and provide MTR statistics for 1000+ iterations from different hosts on the same subnet;
often the technical support provider is interested in investigating the incident, which can last for weeks; in this case feel free to ask for radical measures: physical replacement of the server or its transfer to another rack / PDU / switch;
if during the preparation of the ticket there is an opportunity to set the degree of importance of the problem, indicate honestly so as not to find yourself in a situation from the famous parable about the boy and the wolves;
if an individual person is engaged in “debriefing” with providers, he is not a decision maker, he should not use a corporate signature indicating his position - even if limited to something like: “Regards, Name” - they will perceive it as a decision maker and work more actively on the problem;
in general, in some situations, if this is acceptable to you, it may be useful to build some kind of communication model by e-mail, where among formal, HTML messages you will see a character writing plain text into plain text without the signature — the goal is the same;
In all cases, when you consider it necessary, use the abbreviation ASAP and the phrase “business critical” in different variations - they “love”, especially managers, who must either be put in a copy or, if there is no such possibility, be mentioned by full name directly in ticket to show tech support that their superiors are already aware of the problem;
if for the third time these magic phrases did not work, threaten to leave them right away tomorrow without talking and paying them - this is even more “invigorating” them.

And if it does not help?

In fact, in practice, only a part of the infrastructure of the flagship product Pixonic only once really moved out from provider X (so that they would torment the devils).

As a rule, if you, of course, do not have 5-10 servers for the entire project, the provider is very interested in your remaining with him. And then you need to ask for them, no, not forfeit (by the way, of all the providers with whom we communicated in a raised voice on a business-critical incident, there was only one who immediately asked how much and how much money to return) - because hand in heart, you don’t have a mechanism for calculating financial losses due to cloud provider’s backup, temporary DNS failure, temporary Internet divergence due to problems on backbone channels? Here you need to demand the maximum! The simplest is discounts on new servers / instances, free rent or expansion of existing ones, improvement of the quality of technical support, and the like. But it is more correct to demand a separate manager who, yes-yes, will just personally motivate his team to an increased interest in your business.

Communicating with such a manager also has its own nuances, namely:

Sometimes the situation becomes so acute that I want to write in a ticket with selected Russian language; To start, write, but in English, evil, but restrained and deployed on the merits, mentioning not only the problem itself, but the communications shoals inside the technical support of the provider and the frank stupidity of engineers - such a problem will certainly be escalated to your manager, who will either try to call you , or in writing will offer to phone, promising to solve all the problems; soberly assess the situation and do not waste your time on the manager, if you can directly contact the technical decision maker - conditionally, the head of the third line of technical support;
in a critical situation, call me in a voice is good only when it really helps to solve something; bad example from life: somehow they collected a conf-call from 3 managers from different sides and one Hindu engineer, just so that he dictated a password to me - no need to waste time on this;
Nevertheless, calling and communicating with a voice and meeting in person, if possible, is always useful: personal contact will increase your chances for increased interest in your business and, as a result, for all sorts of “amenities” in the form of exclusive discounts, special conditions for compensations inconvenient regulations and the like.

30 hours of pain. Once

Somewhere in a parallel universe after customer complaints

Not to be unfounded, I will give some examples from practice during my time at Pixonic. In order for the scale of the problem to be understandable, it is worth clarifying: we have points of presence in six regions of the planet, which five providers provide us with different success, and similar problems arise regularly and everywhere. So let's go!

Once we turned off the whole hall! After finding out, it turned out that the provider had previously provided us with discounts on servers in this room, but the manager forgot to enter this information into a specific database. The servers worked this way for several months until one of the provider’s developers discovered a non-working script, which, based on information from the database, regularly passes through the equipment and disconnects the “debtors”, and does not fix it.

We have come across such automation several times with different providers. Once we were given two servers that were periodically disconnected spontaneously. The first suspicion - automation - turned out to be true, it was turned off. But one server continued to fall periodically. The engineers dispersed and checked everything that was possible. Even moved it to another rack, forgetting, however, turn on, but nothing helped. In the end, I just had to change the server.

This is more like the problem of office administrators (“cleaner”), but it is also found in reputable data centers: suddenly communication with several servers in one rack was lost - when studying an incident as an engineer, it turns out that the connectors “accidentally” jumped out of the ports. It also happens that people declare that the connectors are just spoiled. It was not clear how they held up before the incident.

Cables have been cases where hard drives behaved strangely due to poorly connected loops. The tech support on the blue eye answered: “They tucked up better - it worked”.

It happens that the ISP engineers confuse the server on which to work. It also often happens that the performers and their managers get confused in the task and do something completely unexpected. Just like in the joke about amputation: “I told the left one. I said hand.

Somehow it became necessary to update the "iron" 50+ "combat" servers in one region. In order for this activity to have a minimal impact on production, they updated 5 pieces each. So, in each batch one failed server came across! Either a problematic power supply, or broken disks, or a burnt motherboard. As a result, the time to update was doubled.

Also about the update, but operating systems. The provider provides a web interface and API for server manipulations, including reinstalling the OS. In our case, it turned out that for 80% of the servers that needed to be updated, these tools simply did not work! The developers of the provider promised to find out, and technical support at this time offered to reinstall the OS manually. The first reinstallation of Windows took the engineer two days!

More about the interface - sometimes it happens that the “click” there actually does not work, although tech support assures the opposite. Mixed feelings are caused by the situation when the engineers of a well-known provider pick something manually.

Once, the engineers of the same cool provider asked to downgrade our “combat” virtual machines in order to understand why monitoring of several metrics does not work correctly for them. This happened because we ourselves accidentally noticed problems with their monitoring and the fact that they left when we planned to reduce resources for one of the virtual machines.

And finally, perhaps the most "fun" story. For a long time, we had a “floating” network problem with a part of virtual machines in the region: at different times of the day for different periods of time there were abrupt client TCP sessions. The provider very persistently stated that this could not be because it could not be, and we could not provide evidence due to the unpredictability of the problem and due to the fact that everything was normal on the machines themselves, the problems were visible only on monitoring. As a result, we came to the assumption that these virtualkeys are most likely located on the same physical hypervisor. They transferred an assumption to the provider with a proposal to check the host itself, after which the enchanting thing happened. At first we were told that the machines were on different hosts, then they said that yes, after all, on one, and the engineers saw the problem, then they cheerfully reported that the problem was solved, after which all our virtual machines immediately went down. As it turned out, the host simply could not cope with the load, since our machines “generated too much traffic”, and these simple guys decided to turn them off stupidly, “so as not to affect other customers”!

If these examples are not enough, ask questions in the comments - I will try to recall even oh instructive stories. And better share your own.

Conclusion

I repeat, the tips voiced in the article are simple enough to follow in order to minimize problems arising from the divergence of expectations regarding data centers with reality, and focus on business objectives. But these problems will somehow be present. To protect against them to the maximum, build resiliency and flexible infrastructure at the level of application architecture, as we do in Pixonic. And be sure to set up your monitoring correctly so that there is something to appeal in communicating with the representatives of the provider instead of collecting 1000+ MTRs.

Source: https://habr.com/ru/post/331490/

All Articles

How to stop looking for a good data center and start living

Hedgehog - a proud bird: until you kick, it won't fly

30 hours of pain. Once

Conclusion

More articles: