📜 ⬆️ ⬇️

20 years of technical support: how the world has changed around us

image
We are in 1995

My colleague somehow urgently took power supplies to Murmansk - there the plant got up on December 30, no delivery service was working. Yes, and with the conductor or pilot will not pass. Plus, no one has this iron for already 5 years. Found a miracle through a friend, loaded into the car, he went. On the way, the car broke down, he repaired it at -25 right under a strong wind. I arrived, they inserted blocks into the rack - so they got burned right there right away.

In general, take tea and go inside to listen to warm good stories. We support hundreds of customers, including the largest companies in the country. Anything happened. I will slowly tell you how we started and what we are doing now.
')

Introduction


I’ll say right away that I’ll be telling exactly and only about the service of “heavy” expensive equipment: IBM, HP and SUN RISC servers (now Oracle), EMC, IBM, Hitachi big data storage systems, CISCO core level switches, Nortel, HPN, etc. It depends on such systems, for example, whether the call centers of banks will work, whether we can withdraw money from an ATM, check in for a flight or reach the right person at the right time.

Cockroaches in the home PC are understandable to everyone. But the fact that they live in servers is a bit more interesting, and the responsibility of the system engineer of such systems and equipment is much higher. Customers for obvious reasons, I have no right to call, but who knows, many people know. In half of the cases it is wrong, because the same stories tend to repeat. Oh, and where the most critical bugs were going, I changed some minor details at the request of the security men. But let's go.

image
This is what we do.

My most memorable story is this. We somehow set up a cluster in one bank. And a database on it. All this works fine separately, but it is worth running a real request - it does not work. Ping at the same time go normally. We try from another place - all is well. They looked at them for several days, they began to disassemble packages. As a result, we found a network printer in this subnet with users, which had a small bug. And here he is like a time bomb throughout his long working life waiting for a high point. And waited. As a result, he caught specifically our packages. They took it out of the socket - everything worked as it should.

A year later, the story of the Killer Packages repeated. Already another customer. They beat their heads against the wall there for a couple of days, then they called us to figure out what was wrong. We look - according to the documentation all the network nodes are the same, in fact, one of the boxes was different in color. They took out - the traffic immediately went. Then it turned out that it was a non-original module, which was installed during the repair. And his firmware is different. We just replaced it with a normal one.

How did things develop?


At first it was like this. In the 90s, in order to assemble a computer, it was necessary to collect everything separately working on one motherboard - and here any surprises awaited you. Compatibility had to be checked, and the tolerances for iron were such that two identical pieces of iron could work in completely different ways.

image
We are testing computers in 2002-2003.

At first, tech support was a kind of work on the edge of microelectronics. Then they started getting up problems of compatibility of iron, and the problems were solved by replacing entire modules. Now the main job in administration, because everything is very complicated, and the main incidents are most often associated with software.

Or here comes the original piece of hardware and the original power supply for it. But for some reason, the last one does not fit into the compartment even though you are bursting. There can be no such thing, but one cannot argue with reality. It is necessary to start in the morning, so you can slightly sharpen one of the corners of the PSU with a file - and it is pushed as it should. Do this?

Yes? What if you did not understand something, and now you will lose a guarantee worth hundreds of thousands of dollars? Or, on the other hand, suddenly the vendor made a mistake himself? Now you will cut off a corner for him - and tomorrow the service engineers of the supplier will take the ass off - “who allowed to do this?”.

Or here comes the server, there is no one driver. In RuNet distribution is, but left. Will you take a week or wait? If you take it, you will use non-certified drivers, which in which case can be a serious problem. If not, break deadline. By the way, I had such a case: the iron manufacturer released a new driver with a patch for one exotic situation, and the supplier did not include it in the package - they have a test program for a couple of months. I wrote an official letter asking whether it is possible to set something to work. Fortunately, we met, confirmed that it was possible, of course, only until the end of the tests they could not promise that there would be no sudden reboots.

In some places, they even scold us: “even my neighbor would fix it, it’s just replace the fuse”. We also know this, only the case without a certified engineer cannot be opened, warranty. But from the user's point of view, we are bad evil people who decided that the letter of instruction is the most important thing in the world.

By the way, until recently, one specialist soldered. Although in most cases it is necessary to change the whole blocks, soldered. Sometimes there were fans when it is impossible, for example, to find a piece of hardware from a vendor. Or is there somewhere in the factory a device that is already 15 years old, and it is very necessary there. It breaks down, and there are no spare parts to find, neither paid nor free, no at all. They bring it to him, he looks, he rings. Well, why not solder? Took - soldered. Now there is no such romance, and vendors are very scrupulous about everything that is not a replacement module.

Today, iron repair looks like this : at the front, the LED lights up in front of the faulty module, SMS comes from monitoring, you take the module out of stock (and usually it is - this is our job), take out the broken one and put a new one. All repair is over.

But 10 years ago there was still a memorable story. An engineer calls me at 2 am and says:
- Oh, the customer called, they have something there with a local network.
- And what?
- I do not know (he was a student, our intern, did not yet know and understand everything).
- Do we have a service contract with them?
- No, there is no contract.

I called a colleague, asking what to do - to help or not? He says:
- What do you mean! There by the morning everyone is dismissed, they have just now the time for reports. In general, if we can, let's help.
I call back to the engineer, telling what and how. Well, in general, “sex on the phone” - we understand, we are looking for the necessary modules, we check that there is memory, software, we collect all this. The customer understood everything, was very happy that at least someone was helping him, he sent the car. “Meet the young man in glasses, in an acid sweater with divorces, berets, pants like the Beatles, hair bristles on end, near our guard ... can you see this?”. Found They saw, brought to the place, changed everything, started.

In the morning, a colleague comes, says - "Here, they gave me signed contracts in the morning, I couldn’t sign them there for more than a month, and in the morning they came, and they lay." It turned out that we were not the first to whom they called, but the only ones who came to them at all. Then it was a feat, but now it’s in the SLA that it’s pledged to run there, do something.

Even now we have to work at the micro level instead of a soldering iron decompiler. Here is an example. At the end of each month, one call center was falling. During the fall, up to 3-4 thousand calls could be missed. Began to understand, a couple of bugs found, but there or not - it is not clear. Vendor all the hardware checked, also clean. But no, the next month, the call center fell again, they began to sin on the virtualization server, changed to physical, it turned out to be wrong again. And if it falls again next month, the losses will be huge. I had to decompile adjacent systems. It turned out that in one place there was an incorrect timeout, changed. It all worked. But so, by the way, decompiling is also not always possible, permission is needed for everything.

Sometimes you have to reverse-engineer iron. It happens that you need to insert a new component into the old system, but there is no control software: you have to try to understand the protocol, add functionality. Or, on the contrary, to the old equipment need a new piece of iron. There was a case when at one factory the programmer on punch cards was stolen from a laser machine (by mistake, they thought the machine itself was taken). It’s still simple - you have to understand what to input instead of a punch card, and it will work again. Reverse engineering in all its glory.

Or it happens that a new iron comes out, the speeds change, the timings float. What was almost constant on the new equipment may not be so calculated. And where the problem is - it is not clear, maybe the iron is raw, maybe the software with bugs, and maybe somewhere at the junction in an almost random situation an error accumulates.

What else has changed?


Well, probably, there was more prevention. Typically, a breakdown is not only direct losses, but also reputational risks. For example, imagine that the ATMs of one of the banks take off for a day due to an accident in the data center. Losses are huge. Accordingly, all critical systems are self-diagnosed and, if possible, sharpened for replacement before failure. Like flash drives in server vaults - they know their exact moment of failure.

Or, earlier we traveled around the country, and now a lot of things have been administered remotely. But still, we urgently flew to Yuzhno-Sakhalinsk, and in Yakutia we rode reindeer in sleighs to iron.

From old stories - once we come to look at the customer's iron, where critical mail is toss, and there the mice in the power circuit settled. They are warm there, tasty - devour isolation. Cockroaches are also often met. Saw servers in the basins (so that they do not flood the floor). By the way, small animals in the winter in the server plants and large warehouses and runs. They gnaw something small, glitches will go incomprehensible, and you can only diagnose the place. Or, better, small animals act as guides. Not very good, but the conductors. Hence the difficult reproducible problems. Cockroaches sometimes depict such spontaneous fuses.

There are also many new services. Previously, they simply “traded bodies” - they sent away admins. Now we have round-the-clock consulting, a hot line, dedicated service engineers (these are guys who, like firefighters, are waiting for an urgent departure all the time), spare parts warehouses for specific objects, “detectives” (investigating incidents), there are planned iron replacements, a lot responsibilities with software upgrades, complex database management, detailed reporting, financial planning, documentation, monitoring, audits, inventories, test benches for new hardware, rental of iron for hot swaps, and so on. We help to move, and we “raised” and trained dozens of technical teams in large companies.

What is important to the customer?


When we started - so that at least someone did something. Now - quality and speed. Customers are demanding. If earlier a bearded uncle came to speak in an incomprehensible language, he was treated with understanding. Because there is no other. And now there should always be a person who understands the process as a whole - for example, if the plant is up, it is necessary that he quickly could the financial director draw remedial measures or name what the repair can speed up. And then it is necessary to explain what it was, why, whether there is a chance that it will happen again. And who to tear his hands so that no longer happens.
The situation will also change for the vendors themselves, who are striving to meet ever higher standards of service. For example, Cisco has organized the replacement of failed equipment within 4 hours, not only in Moscow, but also in other regions. At the same time, vendor specialists also work 24x7.

Therefore, by the way, often good support begins with writing an emergency plan. There are specially trained paranoids who find the most likely or dangerous places of failures, we reserve them. Something is wrong - we are switching to the backup data center, for example, we are going to urgently investigate. Plans, by the way, are multi-level. For example, a spare part arrives according to the usual plan, and when it is unloaded it is dropped - what to do? They open the envelope “very badly” on the spot - it says what to do if Plan A fell through.

Where there is no plan, it is difficult for our colleagues. As a foreign company, we have financial responsibility, but the psychological pressure on a regular sysadmin is often very strong. “Everything doesn't work, everything is broken, we will fire you,” and so on. Not only is the problem so everyone shouts, but at this moment he must make some right decisions, competent. It is then, in an hour, it will be possible to get brandy from under the raised floor. And now 30 seconds to pull the switch or press the button - and if it’s detrimental, a couple of million dollars.

Or here is a very typical situation for our day. To you comes a new piece of iron. Fresh, in a hefty such packaging from a Chinese plant. You run tests, then gently turn it into a combat configuration at night. It works fine for 20 minutes, and then it starts to fail irreparably. You swear, stop the service for 10 minutes, but have time to remove it from the system without loss. What happened? Yes, God knows. The manufacturer tested a ready-made solution under load for a couple of months, experienced different situations, gave it to real companies, except for letting the children into the car. And the program was very extensive. And then - bang! - everything stops at you. Americans find a bug. The Chinese on the knee collect the patch, you roll it. And right away, problems start to come in from all sides where everything has already worked. Roll back does not give anything. You, what is called, specifically poke in.

Why? Because software plus hardware is a very complicated thing. Here is a new Airbus 370, for example. This is such a hefty plane with a bunch of subsystems. Everything has been duplicated there, reliably, the critical nodes work almost from the impact of a piece of iron on another piece of iron. Before each flight, it is checked. Submitted? This is a very complex structure, in which there are both software and hardware parts that have been developed for decades. The bugs there cost hundreds of lives, and all parts of the aircraft are really well tested. But bugs happen. OS-level software packages can be much more complicated than such an object.

Now look at any new piece of hardware or software and hardware that is being introduced somewhere. In any case, all this will have to bring to mind, maintain, check, maintain. As a result, special people appear, such a kind of shamans - they know where to strike. This is just us.

It is important to diagnose very quickly. Idle time is often counted for seconds, and therefore experience is very important here. For hours to drive tests just do not have time. We need to know thousands of situations from hundreds of customers to come to the place and immediately look where necessary. This, by the way, is another reason why we are considered shamans. As Feynman - we come, we poke, everyone is surprised. Only he randomly showed in the scheme, and we know. On some particularly critical objects, we have 15 minutes in our SLA to solve a problem from the moment a specialist arrives, for example. Or 30 minutes from the registration of the incident. Where is important? Yes, please - the failure of the cellular operator, the problems of the bank and so on. It is clear that everything is reserved several times, but anything happens.

Links



That's all for now. I think you also have a lot of stories from practice. Tell the most interesting in the comments, please.

Source: https://habr.com/ru/post/200830/


All Articles