How we disaccustomed outsourcing to throw the ball to the internal IT department

We use both outsourcing and our internal IT resources. On one physical server there can be a service for which external employees are responsible, and a service for which we are responsible. And from the season, these services can migrate inside the company or go outside.

The story began with the fact that we needed a centralized system with a farm of terminals. At that time, we had about 10 stores, and each of them had its own database, the data from which were used to compile a cumulative report at the end of the period or upon request.
')

Centralization

The described scheme turned out to be very inconvenient - it was necessary to know all the time where and what is in warehouses, plus display the availability of goods in real time on the site. Accordingly, we centralized the base and made the stores work through the terminal access to it. Prior to that, we did not want to do this because of the requirements for the communication channel.

Naturally, there was a problem with turning off the Internet: if the channel weakened or fell (which happens almost always - a matter of time), the store was left without a cash register. That is, it was still possible to punch checks (at that time, the division of IT and physical cash register was used), but accounting went to hell until correction. It was impossible to admit this, so we decided to go in a simple way - by reserving channels. They brought in each store two physical cables from different providers, purchased “USB-whistles”, but, in general, the situation from this, if it became easier, was not much. Anyway, someone fell sooner or later.

The next step was a distributed database. The master database is stored in the data center, and each store has its own version of the database (more precisely, a fragment relating to this particular store). In case of failure of the Internet channel, the store continues to work in its version of the database, and when a connection appears, an asynchronous update of the main base occurs. In the presence of a stable channel updates go in real time. If the channel falls - the data is “achieved” during restoration, just the store leaves the network for a while. By the way, the same availability of goods is already considered at the central instances according to expectation - for example, at the speed of sales of 5 boxes of the game per day and there are 8 boxes in the store's warehouse without communication for dinner, there will not be “Available”, but “end, call”. The goal is not to make it so that the person arrives, but there is no game.

Iron

Iron has been changing all this time. The first server was in a fairly cheap data center, but from there we very quickly left because it was impossible to work. Then we followed the “everything home” way and tried to deploy a server in the office, from where, in fact, replication of the store databases was done. The problem is the same - when the Internet drops in the office, all the stores are left without IT. Despite the two channels, this happened at least once, when they decided to dig a hole near the office to change pipes.

Accordingly, the next step was the transition to a fault-tolerant data center. Chose a long time. Now we have 10 physical servers on which virtualka are deployed. In case of failures, services can migrate between machines, there is a fine load balancing. In the season, the required power grows dramatically, we buy more cars, then in a month we rent a rental - very convenient in terms of savings. There is no storage system for the base, it is spinning on one of the servers where we finished off the shelf with SSD disks. When we started to run into long miscalculations for the period, we tried a cluster with midrange storage, but it turned out to be quite expensive for our capacities so far, they refused.

As a result, there was independence from the office infrastructure in terms of critical processes: if something happens, our designer’s work will disappear to the last backup, a couple of XLS tablets - that's all. All commercial processes will be restored almost as soon as a new iron appears. By the way, about iron - for the New Year we duplicate all the nodes not only in the IT structure from the server side, but also physically. If the terminal fails in the store, the warehouse has exactly the same pre-configured system unit in the warehouse, which you can simply connect in place of the failed and you can’t figure out what's going on - the Windows, the capacitor on the motherboard or the failed settings.

Outsourcing

We have our own IT department and an external outsourcing company that complements it. The most important thing in the scheme of work - managed to realize a situation where there is no transfer of the ball between us and external employees.

For a start, we clearly divided the services. For example, if there is a physical server, then on it:

Virtualization system (administration and monitoring on outsourcing)
Win-machines (administration and monitoring on outsourcing)
On one of the machines there is a DBMS (administration and monitoring on our IT specialists)
Base 1C (administration and monitoring inside us)
Backup databases (outsourcing).

And so on. Then we have very clearly prescribed every price for each node and SLA to it. For example, if a computer goes down in a store:

At the peak of the season - there must be a pre-configured system unit, which you just need to reconnect.
The rest of the time - SLA 4 hours to restore the operation of this node in the database.

Services are divided into levels:

Services of workplace level (for example, support of an automated workplace of the storekeeper).
Site level (network printing service in the company's office).
There are company level services (terminal server in DC).

Each service has its own indicators: reaction time, readiness time, cost, “fine”. It is a lot of services, and indicators accordingly too. The system has been refined over the years, and a simple listing of services takes about 30 pages. The guys argue that such an approach allowed them to get rid of the chaos in their heads and to offer exactly what was needed to other companies.

An example of how an active SLA works (note that when it comes to a legal entity, it became possible to set deductions for jambs):

SLA for an extremely important service corresponds to the following indicators: operating mode 12/7, reaction time 6 minutes, ready time 1 hour.
1 hour of downtime for super-important service - minus 100% of the monthly cost of service.
2 hours of downtime for super-important service - minus 200% of the monthly cost of service.
Further deduction from the bonus continues to grow linearly, but not so fast - 8 hours - 300%, and so on up to 400%. At the same time, the cost of the service is the payment for it to the outsourcers, and not the cost of the service for the company. For example, if servicing a single computer in a store costs 700 rubles, its failure for a shift is 2100 rubles.

Despite the fact that the store lacks profit during this time (the second ticket office does not work), this has nothing to do with IT outsourcing and its payment - the guys simply cannot be responsible for such things. Before such logic, we walked for a rather long time, and we had to stand in our place and in their place in order to understand this.

Vova (our former admin, who just founded bitmanager.ru) says that thanks to this kind of work, his quality indicators for all customers have increased. More precisely - he understood well what retail needed. No, at first, of course, he beat his head against the wall, of course, but then he learned to control the processes from within with the help of a balanced scorecard. All tickets are summarized weekly, SLA indicators are calculated monthly for all tickets, and a penalty or bonus is calculated if necessary.

Then it became even more interesting. The fact is that depending on the seasonality of the business, the load on our internal IT department changes. And it turned out that some parts of the infrastructure can be outsourced, for example, on New Year's Eve, when IT is needed in other areas, and then taken back “under your wing” at the end of the peak. Because it saves costs. As a result, at the beginning of each month, outsourcers recalculate all nodes and services that are under their responsibility and invoice them. Accordingly, this approach to evaluating each service and node also means that no one in the divisions has “free” hardware or licenses — everything that needs support, money or an unused asset is returned “to the base”, from where it can be more efficiently use.

I must say that they were helped by the fact that at the initial stage, when we were just building our relationship, the external support thought that it was enough to have an on-site admin service. But violations of the SLA caused fines or deprivation of bonuses, so they adapted their processes to support retail. Now they work like a clock - each ticket is placed in the support panel, the necessary resources are allocated for it. The last exit from the SLA was over a year ago.

PS Well, to tell further our adventures , or well, what the hell is this IT-infrastructure of medium-sized business, which is already familiar to everyone?

Source: https://habr.com/ru/post/237579/

All Articles

How we disaccustomed outsourcing to throw the ball to the internal IT department

Centralization

Iron

Outsourcing

More articles: