Features support 10 data centers around the world: my experience and rake

This is 2 Petabyte backup

We have 14 data centers around the world, of which I serve ten. About five years ago, I thought that there, abroad, everything glitters, the support is attentive and polite and makes mistakes only on small things. My illusions quickly disappeared.

Here is an example. Our servers stand in racks, in fact - disk shelves designed for “slow” data backups. The place on them ended. Each server had 24 disks and 36 slots, we decided to finish 12 more HDDs. I sent the tickets, explained what we are doing and why, and added that you need to put the disks in the unlighted slots.
')
After 10 minutes, monitoring showed that we had a disk in the first server. “Wow, colleagues are burning,” we thought. Probably touched or something else ... But then the second and third disks almost immediately fell out. I started calling German support, and a colleague from India answered me.

By the time we managed to stop his Greek colleague, this “terminator” pulled out 12 disks from five servers and was preparing to start the sixth one. The system did rabid Rebild. When the sapper realized what exactly went wrong, they started inserting the disks back into the servers. And a little bit, quite a bit, mixed up the order. That also affected the madness of Rebild. But, fortunately, thanks to the detailed explanations, it was possible to prevent the backup from rolling - this would interrupt the service for half an hour.

So I found out who specifically and how can work in support. There is also my mistake: I was counting on the usual support of the second line, which understands me in Russia from a half-word. And did not take into account cultural and linguistic differences. Since then, we have written extremely detailed step-by-step instructions in the spirit of:

Go to the server such.
Make sure that this is the server by checking the number of such and such.
Count the fourth disc on top.
Find the eighth disk from below.
If it is the same disk, carefully remove it.

In general, we presume that any place where you can do something wrong or understand something wrong will be exploited as a vulnerability in the code. Colleagues sometimes still throw up new unconventional visions of ordinary actions to us, and we supplement typical forms. I already have two-page instructions for every normal operation. It helps a lot in work.

St. Louis, United States

Our first data center is located in the USA, in St. Louis. In it, we placed a cloud backup of users at the very beginning. Given the popularity of the service at that time and the general misunderstanding that backup should not only be done, but also stored outside the house (Dropbox was born only a year, advanced backup users burned almost on discs), we thought about architecture and scaling not much. As it turned out, nothing. The load began to grow faster than we expected, and our hoster PlusServer AG could not take the iron at the required speed.

In general, we have two types of data centers: where we rent the area (they provide racks, cooling, food and security), or where we actually take a very large collocation (they give a section of the machine room, the Internet, support). In the first, our local engineers work, in the second, there is no direct access to the hardware, and the data center support team works. In the case of PlusServer AG, there is a certain intermediate scenario, and, basically, we use the services of their engineers. Difficulties or embarrassment with them I do not recall. Pah-pah ...

Now the St. Louis Data Center (our section) is half inactive and is waiting for migration - there is a lot of old iron, which is used only by testers.

Strasbourg, France

This is our second-rate data center, and there is also a lot of “adult” hardware there, even, it seems, there was a pair of i3, which testers took from the main infrastructure for “bullying” during crash tests.

All the same Plyusserver, but communication with support is surprisingly difficult. Sometimes it is very difficult for them to explain something. As you have already read above, if you need to explain something - it takes half an hour for all possible scenarios. Instruction less than 30 points to restart the server, most likely, will be perceived incorrectly.

On tests of a 10G switch, when we asked to configure the network on a new server, at the moment of execution of the ticket, the entire data center fell off from monitoring. It turned out that the person who configured it confused the gateway and the IP server - and through one of the servers all the others tried to break into the network.

Tokyo, Japan

Our third data center is located in Tokyo, the service provider is Equinix, and the Internet provider is Level 3. At the very start, the two of them could not agree on cross-connections. We then needed to get a burstable-channel, that is, a line for 10% of the possible disposal now, which we planned to expand 10 times in two years. At the point of contact (MMR, Meet-me-Room, the entry points of the provider channels in the data center) there was no bundle at all.

Level 3 said they did everything right and do not plan to redo something. As a result, a week and a half, I first figured out exactly what was wrong, and then persuaded both companies to do as they should, gathering for the conference call representatives of their various divisions. Everyone did the right thing and exactly, according to their convictions, and did not want to admit his mistake. Therefore, I asked "to meet us and do a little more." Made.

The most pleasant thing about working with Japanese support is that they are incredibly executive. This has a downside: the instructions are needed almost the same as in Strasbourg. Because they are incredibly pedantic and ask very, very many questions. Once, they put the controllers in one server for 12 hours (!). If at least one situation arises where there are two possible answers, and the support officer knows that the first option is correct by 95%, he will do as it is logical. Almost everywhere. Except Japan. In Japan, he will stop, describe in detail the dilemma, and wait patiently for your answer. For example, they always stop the process, if there is more than one free slot inside the server, they ask which one to install.

Frankfurt am Main, Germany

Here, too, Equinix, full support from them. The data center was planned as a small auxiliary in the CDN, but it turned into a serious platform. It was these guys who dragged us 12 disks from the servers on Friday evening.

The checklist is:

Break instructions into short paragraphs.
Communicate in monosyllabic sentences.
Try not to create opportunities for decisions "in place", that is, to prescribe in detail all the options.

Then everything works just fine. I must say that there were no other incidents after the introduction of such rules.

No, there was another story. The place ran out, bought discs immediately a bunch of boxes. But not at a local supplier (he did not have it so quickly), but in London, and was transported from a warehouse from the Netherlands. The car arrived on the same day. A letter comes from the supplier: they say, we delivered the discs, the recipient refused, we carry it back. It turned out that the brave security guys did not find directly on the boxes to whom these discs were intended. And turned them back. Since then, we always ask you to correctly sign the boxes if you are taking to a data center with full service.

By the way, Seagate is very operational - they out of the goodness of their heart decided to return the disks to the sender’s warehouse as quickly as possible, because obviously the customer made a mistake in the city, and the disks are urgently needed somewhere in another part of the planet. We caught the delivery after the aircraft: it took another flight back. Taken from the second time successfully.

Singapore, Singapore

The fifth data center was also under full service, only the provider is Softlayer. Here for all the time there was not a single story, not a trace of any misunderstanding. In general, no problem, except for the price.

It is very simple with them: you say that you need, they bill and they provide the infrastructure. Their prices are one of the highest, but you can and should bargain - different intermediaries may have different options for the same services, for example. There are a lot of personnel, judging by the responses to the tickets, but absolutely everyone is competent.

Sydney, Australia

We wanted to do the sixth data center with our engineer. But it turned out that finding a specialist of the required level of competence in Australia is quite difficult: we needed, roughly speaking, a quarter-rate freelance administrator who would come a couple of times a month and do current work. Plus, I would leave for an accident. Usually we are looking for such candidates through special agencies that provide us with three or four dozens of specialists already prepared for such work. After that we select up to 10 profiles and conduct interviews via Skype. There remains one person working in the state, and another 1-2 in the pickup, so that if anything, they can replace him, for example, in case of illness.

The problem with the data center in Sydney was the server with 72 disks. They required a hell of a lot of power - there were 6 such servers per rack, and each ate up to 0.9 kW, and the rack ranged from 6 to 8 kW. My colleague says, if you come in after a shower, after 10 minutes your clothes are completely dry.

London, England

In London, we are combined with Acronis Disaster Recovery. This is the most boring data center, nothing happened there. Year as put iron. So nothing happens. Knock, knock.

Boston, United States

In Boston, our largest data center. Plans have his move.

In Boston, we experimented with 72 disk servers in 4 units. Problems just ate, because the server is just magical. The Boston office has our guru admin, but he actually does something else, and every time calling him through the whole city to replace the disk is somehow not very correct. Yes, and expensive.

We write tickets to local support. But it does not work from the data center, but from a third-party company that provides the racks of this huge data center. And they do not allow anyone else to the site: either our guru or their support. They themselves are able to insert USB disks into the machine for initialization, change failed disks and reboot the server. Everything. Once from a 72-disk server it was necessary to pull out specific disks. There are double sleds, two discs one by one. It is difficult to figure out where and what, so sometimes they still touch the wrong discs. I had to go.

At some point after the start, an electrician came running to us with a huge letter. The bottom line was that 7 servers of 0.9 kW each per rack is a bit overkill. He says he consumes 115% of the nominal, you need to unload. And the other racks were typical - local, where behind are two blocks with sockets, where, in fact, stick the power. Our servers are 20 centimeters longer than usual centimeters - and exactly these 20 centimeters they covered the power slots. We diluted the rack with “short” servers. I remember, while we were playing Tetris, they changed 60-pound cars there, dragged them together, improved their health.

Russia Moscow

We store data of Russians strictly in the Russian Federation. If you are waiting for awesome stories about our support - do not wait. Although, of course, a couple of surprises from DataPro are in the piggy bank. In the US, the usual practice of technical work - for a week comes a letter in the spirit of “We have planned this and that, this is necessary and useful for the entire data center and for you, the work will be in such a certain interval. You will not be disturbed. ” In Russia, notify a little differently, but you yourself probably know. But I must say, the service has never been interrupted.

Before that, we stood in Tver. When iron was transported from Tver to Moscow, in 2 weeks the transfer of 15 servers to the hot one worked. They wanted to in one iteration, but decided not to risk with downtime. We drove 2 servers every day: I got up at 6 am, gave the nod to the packaging, delivered it - at 11 pm they wrote that they had brought it - they installed it, checked it. They raised a virtual network between Moscow and Tver on a good link, the servers thought they were on the same physical network with the same addressing as before. So they dragged 2 cars each: recovery, rebalance, check, 2 more servers.

Ashbourne, United States

Boston iron is only going to this site, so far there’s nothing to tell.

In Ashbourne we carry everything from Boston according to a scheme worked out using the example of Tver. Also raised 10G-link, and the machines in the same network while maintaining the addressing. The idea is that you need to raise the iron in a new place and wait for rebalance: if, for example, half of the disks fall out during transportation, then you need to wait for a long time for their restaurant, and not carry the next batch.

More cases

In Europe, sometimes there are features with customs: when we urgently needed to change one burned server, we sent the car from Boston, and that’s 2 days instead of three weeks from a supplier’s warehouse in Europe. But we did not take into account the customs officers. There was not enough VAT number, and they could not due to the language barrier (French) to agree with the accounting in the United States. Sent everything back. Since then we have been ordering to Europe from Europe.

In Boston, there was a problem that the 36-disk server is filled in a week and a half, which is 200 terabytes. The order is also from two weeks - and it turns out that we did not have time to order the server on the wave of more than a successful launch of one of the products. Then they decided to use new packaging principles and partial distribution among other data centers, changed a lot in architecture. I was touched by the fact that I had to re-adjust the procedures for purchasing and working with suppliers - since then we can take large lots and more quickly according to preliminary agreements, pay later.

Once they took to server tests with an incomplete configuration for “slow” data. There was only one processor instead of two, and inside - a shelf of disks and a 10GB network card. Turn on - the card does not work. Do not see the server, these cards. We read the manual - there is a moment in small print: PCI slots are divided by processors, odd - at the first, even - at the second. All slots work exclusively when there are only two processors, but power comes to everything regardless of this. The card blinks and burns, but it does not see the server. They moved it to another slot, although, it seems, the manufacturer still had to do this on tests.

At DataPro in Tver, trees somehow fell on optics: the public network was gone, they explained by phone what happened. One line was now exactly a couple of poplars, and the reserve was not working. Network equipment could not switch to the backup channel.

In Germany, Level 3 in 2015 set up their own routing and slightly touched our piece of equipment with a couple of levels below. For half an hour the connection fell off. At that moment, the European data center was the main thing - this led to the termination of service in parts of Germany. Since then, we have changed the architecture, but colleagues will tell it better.

In the US, there was a case. Probably, this is the most fun for all the time I've seen. They changed the server, called the manufacturer’s engineers to replace the motherboard and power supply. 72 disks, gross weight - 80 kilograms. These bunnies began to pull him out of the rack along with all the tripe. It turned out they only half - he began to roll and fall. They tried to hold him and pull him out, but they bent the sled. They tried to push back, but they were not allowed to have already bent rails. They took and threw in this state. They said they would come in a week with a replacement.

In general, as you see, there was only one conditionally dangerous situation for 5 years, when there was a question about rolling backup from another site. But the cost. Everything else was decided locally and fairly well. Minor roughness is a common human factor, and it would probably be strange if we had fewer stories.

Source: https://habr.com/ru/post/278199/

All Articles