📜 ⬆️ ⬇️

A sysadmin with delusions of automation and a big remake of processes



A couple of weeks ago, an employee from the wholesale office came to the IT department and asked me to finish the small feature to my workplace. Application expectedly queued.

The girl was a little offended and said:
- This is your off-season now, and you don’t have time. I will look at you, that then the new year will be!
')
It was assumed that under the New Year’s rush to the IT department with this formulation of Khan’s affairs. Nobody explained to the pretty girl that our IT season is in the summer. Because then, when the retail season comes, there will be a late rush at all.

And this summer it was hot. The system administrator Valera, obsessed with automation, gave a special touch to the processes. So much so that he even tracked currency rates and weather by Zabbiks. In general, come and tell you how we spent the summer. And general cleaning. There is nothing particularly remarkable, but I always have a couple of useful rakes for medium-sized businesses.

Cleaning time


In short, we have a total of about a hundred objects where you need to monitor channels and equipment. These are stores in the country (in franchises, the owners themselves monitor iron, we provide software), warehouses, manufacturing, offices. The main problems were that we grew up in successive “molts”, so the iron park is not homogenous, processes were all born in the process of evolution “as is”, and instead of system building the approach, we solved current problems. I wrote about the whole structure in more detail here .

At some point, Karina, the head of our internal IT service, decided that we had enough of this extravaganza, and it was time to arrange a spring-cleaning in half with refactoring. And in life, and in 1C, and in the gland, and in general in the approach. As soon as the decision was made, she simply rewrote all the current minor tasks and postponed them for two weeks. Because there was no such thing so that our IT people would not be thrown over with all the little garbage that was eating a lot of time as a result. Well, some of the tasks were not ready since 2012 - these are the ones that we always put on with low priority for sometime later, hoping that “later” will not come.

In general, it is very simple: when you have 60% of work routine, it means that you do not leave time for strategic development. Karina decided to take such a measure, expecting a flurry of user discontent. And users - a surprise! - hardly noticed. First, when IT is at its peak, at the office the middle of the low season is a relaxed summer. Secondly, it turned out that many urgent unimportant things can suffer. In general, if you think about it, in principle there are no urgent unimportant things: there is either urgent or unimportant.

Outsourcers


We often and a lot of use the services of external IT companies, so as not to drive the network stores themselves. Prior to the beginning of this year, the approach of sharing responsibility between external companies and the internal IT department justified itself perfectly.

But it so happened that we started our own admin. The previous business model assumed that any joint, for example, at the Heat Mill, could be decided by the field administrator of the outsourcer. We also repaired laptops with them. They even somehow disassembled the encryption virus to us and recovered the data. In general, there is no reason not to trust what is called.

Of course, not without jambs - people on the ground sometimes showed themselves extremely unexpectedly. From the recollected examples - the guy had a task to put the Windows on the server of the store. There were 8 gigs of operatives, with the previous ticket they finished it to 16. The guy came, began to bet - not put. Report on server unavailability, left. Diagnosis - problems with HDD. Ok, our team looked - the HDD is fine, the operative is buggy. They answered politely: guys, you have a memory there, it seems, not alley. Perhaps two identical DIMM modules to deliver? The field went to refusal: he says, I am a pro, I know everything and can do it, walk in the forest, dear friends. And put the Windows. Only 32-bit, so she saw only the first 4 gig. And closed the ticket. The team solved the problem like this: they called and said, they say, our marketing was stubborn and wants to have the same memory bars in each server. This is now our corporate standard. You can do? (here I must say that “Marketing said” is a universal excuse for all occasions). Well, they changed the memory, and then put a normal 64-bit OS, everything went wrong.

Or a second example - the girl did not open the document in the region. Run out of trial on the office. Eniky connected to the remote and put her crack. Fortunately, we caught it in time. As the representative of the outsourcer said, “we have no right to do this, but excesses on the ground happen”.

In general, they brought their admin. An old acquaintance, by the way: it was he who somehow stuck with us to help in the office six months ago, and ate three liters of cream and mushroom sauce from the refrigerator at night. Slava just tested his recipe. Sauce Valeru is not killed, but made stronger, apparently.

20% of the routine down


Most of the resources occupied routine - tickets, which were repeated from time to time. Users complain about everything - to the extent that they are asked to replace the paper in the printer, fearing to approach it themselves (a real case in the office). The more often they reach us, the more we think how to automate.

The most embarrassing history was emergency trips to shops and communication failure tickets. For the Internet, when we do, we have a couple of entrances to the store. As a rule - a shopping center monopoly provider and someone's radio relay, or an “street” FttB cable from a general house gateway plus a cellular modem. Where it is impossible - we are saved by the local server. The most frequent problem is not the fall of the channel, but its slowdown to real lags in the work of accounting and cash.

Valera put all the traffic on the monitoring, and began to monitor the ping (or rather, one-way delay) and the degree of utilization (where it did not work out - just specific traffic). When you reach critical parameters like 60 Kb / s and triple ping from the norm, you should do three things:

  1. Automatic switching of the store to the backup channel.
  2. Notification of the IT team at the head office on the screen with urgent events and mail.
  3. An automatic letter to the provider (written in human format, like, “Dear colleagues, at 18:03 Moscow time in a store such and such, where your line is connected under a contract of such and such, anomalously long ping was recorded and a very low speed of 53 Kb / Please clarify the situation, take action and recalculate the monthly bill due to this problem. "

Of course, such a letter is sent only one at a few hours. It is possible and easier - monitoring unloading. It is also important that often the reason for recalculation is not a short-term absence of the Internet, but a shutdown for a day or more. Some (monopolists) do not count at all.

This is a store switch that removes all the problems in the field. The end user does not even have time to understand that something has gone wrong. And, means, the helpdesk does not call and does not distract. Letters remove a bunch of routine work. It remains only to check the response of the provider, if necessary - to call and finish the ticket.

By the way, one of the contractors said that he did not have a mistake. We collected the graphics and sent. He was very surprised that the same top-level problem was still on the three connecting top-level providers - and wildly apologized.

At a dozen points connected MGTS main channel. This is an office, which is surprisingly good for legal entities (it puts, in fact, prices for home users) and provides close to the ideal quality of the channel. In comparison with other providers. But, of course, to send them an application, for example, to connect, you need to write it on paper in three copies in their office and go through a short version of bureaucratic hell. And our every call is very important to them. Although, of course, we now have a personal manager, plus we connected an electronic turnover through Dyadok. Now everything is simple, except that one statement, rather than three, is enough to switch off the point, but you still need to wear it. For an erroneous payment (paid twice for one point, for another - no) also kicked to the office.

And they have a radio channel: a little expensive compared to the usual one, but it helps a lot in some places where there is only one provider.

UPD: after the post we were contacted by their director of corporate affairs and carefully asked what could be improved. Very cool.

Valera then rushed to automate everything. Here, for example, our counterparty on the site was not able to determine when one of their servers with the cache was lying down. The admin chuckled, did not argue, but simply put monitoring on him - and also made an auto-writing. While he was kodil, his food almost disappeared in the fridge - he rummaged around in the bag of his sensors and put a thermometer there, and then connected it to the general office telemetry and set a valid interval and alerts. We decided a lot of problems, which sooner or later would come back. These are, for example, brakes due to file system fragmentation, discrepancies in server time and cash register equipment (now the norm is 15 seconds, the alarm is triggered for 3 minutes, the law allows up to 5 minutes before the fine, before that there were problems with checks return for rassinhrona), on the server and terminals. Everything after the decision is put on Zabbiks and displayed under the alerts right in the office.

With these alerts, too, was the story. Under each type of alarm (the server fell, the cash desk fell, the channel was degraded) they recorded the cry of a certain animal. One more sound was needed, that everything was in order, for this they wanted to put a duck, but it was not found. On Sunday, an alarmed cleaning lady called — while she was washing the floor in the IT department, the monitor on the wall said “quack!” Several times in the voice of Tima's familiar voice. This made her very alert. So much so that she raised the alarm. Then, of course, explained what kind of zoo.

Shops: we understand with iron and a network


Shopping was more interesting. First, a lot of things had to be counted. For example, we had some old UPSs that were not monitored. It turned out that once a year it is almost 3 times more profitable to change the batteries for them than to upgrade. By the way, it turned out at the same time that our expensive outsourcers at 4 points stuck the machines in Surge Protection, and not in the Battery Backup-socket. No, of course, gradually we will replace them with “smart” ones, but as they fail.

Secondly, they bought new routers. Of course, these turned out to be Mikrotiki — the very ones who are already very bold at home, and in big business they can no longer cope. But in our segment the most that: they are quite flexible in terms of settings and at the same time are quite cheap. Actually, the channel-to-channel adjustment scripts are raised on them. Yes, before updating the network, one of the older stores himself (!) Repaired the router - but not very successfully. Broadcast storm covered segment.

Third - backups. Incremental backup of basic things is done in the intervals from 15 minutes to an hour. How much it saved us trips - do not count. And plus the sea of ​​routine was spent on “it was definitely in the document”: if the document is not worth versioning, you can deploy the backup and pull it out. These backup databases are very helpful in tests.

Then they put the documentation in order. Dokuwiki deployed (because there is a good role distribution).



Introduced a new process rolls releases jobs. Previously, and rolled all, and now - smoothly and gently, but inevitably. Just like a laxative. For example, when the operator's workplace was changed — first, the head of the call center worked for two weeks from a new one, as he used to, rolled over to everyone. With a detailed letter, what has changed, why it was done, why it would be better. And the operators asked questions no longer in the helpdesk, but immediately to him - he sits next to him and knows everything. Of course, they took hapnul tickets right away (especially delivered by the user who directly demanded to return one little used function - and we knew for sure that it hadn’t worked before).

We fasten features gradually, one by one. Once a week, a newsletter is sent about the main innovations to the whole country - there are instructions and explanations of the main points “humanly”. Again, with live contacts in reply-to.

Previously, the problem was that even a small task was difficult to complete. I had to explain each time not only what had to be done, but also why and why, in order for an enik or an external administrator to understand the logic accurately. For example, with a test server, the ticket went up like a zombie 4 times - and each time it was necessary to explain the specifics of the story to a new performer. More developers in retail should often talk to the admin. If only because an unsuccessful night update is a job for two. Sometimes you need to quickly restart the server, sometimes to finish off the disk - in general, it is good when everything is near.

The first month, Valera was literally fucked by what people were talking about. And the worst thing is that there is our policy about the fact that many things need to be explained to users in a human way. Not just “we will change this figovin tomorrow”, but what we will change, why, how they will be better and what risks this will eliminate. So they become friends and understand that they are taken care of. In general, almost all of our IT team is among those who know exactly what is happening on the front line and how. Tim (second admin) worked as a store administrator (shopping, not system). The second Sasha was the seller. Max (helpdesk) was an operator. At the time, Karina herself went around almost all the departments.

But back to the field. After Valera considered what and how, there were about 4 times less trips. There are three reasons:


As a result, for the last month, the second admin for 25 stores in Moscow did not even spend 60 passes for traveling tickets.

There is another important example about “not being afraid to speak to your own people”. We have operators, and many of them are the sweetest girls. Aytishnikov, it seems they are a little afraid. And infinitely believe in their superpowers. Specifically in omniscience. One, for example, sat for two months and suffered with a bug, put up with it, went around. Then I became very attached, I asked - they say, when will it work? It turned out that she was the only one who used the hot key, and therefore other operators did not set the ticket. And she waited and waited. Perhaps, in your company somewhere there is also such a user who suffers and thinks that he will be helped someday.

The best story so far is the St. Petersburg server. After updating the network, it was necessary to pick up a server in one of the technical premises of a store in St. Petersburg and deliver it to Moscow. We decided not to risk transport - a piece of iron is expensive. They asked a good friend who drove there to a business conference. He called, said that we had a bit of that, but he took the server. We met at the station. He emerged from the train with a heavy and unbelievably old sistemnik (a little more powerful than Raspery) and handed him over to a crazed admin. It turned out that in St. Petersburg in the same improvised server room there was a warehouse of unnecessary parts. And a couple of years ago the desktop of the same company got there, which was done by the server. Sellers decided that a small shiny box just could not be a server. Well, no way. They know that any server is always more powerful than a regular computer ... and showed our friend to a large decommissioned sistemnik. With all the honors, the veteran was taken away, carefully packed and delivered.

And finally, the scattering of non-standard tickets. Here, for example, miracle-installers cut off the sensors at the root when moving a store, the caretaker miraculously put the tickets and persuaded to do this helpdesk:



Preprinter found a rare bug in Indizayn, sent him to Adobe. They cursed, they say, was closed back in 2008, and he, the infection, surfaced again.

CEO Andrei disconnected the external monitor from the laptop when typing something. It turned out that a one-page document is printed on two pages with the screen connected and one (as expected) with the screen off. Why so - we do not know yet.

On one of the buildings, the cable with the Internet from the roof cracked for a year, water poured into it. At some point, the router literally swam away.

The user complained about the brakes of the machine - there, for 5 years, Winda has been consistently filthy with employees of different departments. Demolished everything, put the OS from scratch, took out half the RAM. The user is now happy, says, works much faster.

In general, the “cleaning” echo has already died down, we are almost ready for the new year. Now we are doing an important thing for a business - we cut through financial planning along with IT.



Valera does not stop, and we watch him with great interest. In the photo - the predator plant Venus flytrap (of the Rosyanka family), an idea for automating the catching of insects that flew into the office during airing in the summer. Proved successful, adopted in operation.

Source: https://habr.com/ru/post/312258/


All Articles