Server insurance company refactoring: when there is less physical space than data

Insurance is the third largest consumer of heavy IT hardware after banks and cellular operators. At the time of the start of work, the situation was this: in the office of one company there was their central server room (resembling a small data center machine), and in general, everything worked perfectly.

The problem was that the space under the racks (and in the racks themselves) in the server room ended 2 years ago. In the other two data centers there was a place, but here it was not.
')
The second problem is that the main production base lay on 149 volumes , physically - like Swiss cheese in servers. This was due to the fact that when it was necessary to increase it, they found the first free hole in the physical disks and shoved it there. Between database volumes there could be databases of other projects, software, various temporary files, and so on. In general, it was necessary to restore order.

Another interesting feature is that when new data appeared, a new volume (LUN) was required for it. He was cut, and he immediately became a bottleneck. The explanation is very simple - in the combat base is the most loaded place - just new data. And when they are physically located on a single disk, its maximum read-write speed limits, in fact, the entire system.

The options were as follows:
• Upgrade existing arrays (considered: expensive and inefficient, there is no place to put);
• Transition to a new generation of the same arrays (high cost for maintenance, constant tuning);
• Search for alternative solutions (flash calculation and tests).

Changes

We decided to start with a small piece of data, a separate subproject with its base. Test on it, and then slowly and smoothly move to a new system. Naturally, everything worked there before the start of work. I just wanted it to work faster and more correctly. This data lay on the Symmetrix DMX-3.

We brought 3 units Violin 6264 and began to shaman. To begin with, we sat down with the Oraklists for the optimal physical structure of the base. Taking into account common sense, OS features and base architecture, we decided that 27 volumes were needed instead of 149. Later, 2 more were added, it turned out 29. At the same time, it was estimated how many empty spaces were in DMX - if there is not enough empty space between the volumes, to cut something new, it is just skipped. As a result, about 15% of the free space was spent on such “gaps” between independent pieces of data.

Of course, this has not yet been optimized in terms of performance. On DMX one volume could lie on 7 physical disks. In Violin, by virtue of the controller architecture and data stacking algorithms for the factory, the LUN is “smeared” across the entire storage, which allows you to get maximum performance even if the threshing servers decide to grab one specific piece of space in a couple of gigabytes.

It is clear that I had to agree on a simple one. Sometimes there is no way out, we need to make men's decisions. But much later, another base of a larger size was made via stand-by-the main base is copied, then half an hour just read, then again the ability to write.

They put down the base on a deep Saturday night, changed the structure, copied the data, brought it to the new hardware. Made measurements, everything is OK. Raised in production - works, and works, as a whole, well, but not as fast as we expected. In the sense that the storage resources are not fully used.

They started to profile flows - it turned out, the settlement servers themselves became a bottleneck. Previously, they waited for a response from the repository, and began to thresh to the full, despite the fact that the storage system can give more. As a result, the customer was very impressed and bought 2 new "boxes" - one for this base, the second - immediately "for growth" for the main one.

With the new server got what they wanted. The test is over, the result is good, the customer thinks about scaling up, slowly and quietly purchased the iron in steps.

What a pleasure, they solved the problems with the architecture of the base, the question of growth in speed-place, the problem of physical space in the server room, and the issue of nutrition that would arise next year. It was a little sad to remove the high-end DMX that was suitable for the third year of operation, but such is the fate of all the hardware. Perhaps he will find his new life somewhere else.

Why Violin?

Habr's favorite question is why such a cosmically expensive iron. Yes, Violin, as well as any "real" (in the sense, not from SSD shelves) arrays without overhead HDD-technology is very expensive per unit. A question of hundreds of thousands of dollars for the whole system. On the other hand, the story is like this - if you can afford a flash array on serious data, you can definitely afford Violin, because in the long run it pays off very well. Install expensive - operate profitable.
Naturally, there are cheaper solutions, but for our task there were a couple of important requirements:

24x7 base availability. The piece that we took in the test is the primary analytics of insurance claims. Roughly speaking, this is an estimate of the coefficients, which is not needed in real time. Received data, grind, updated formulas. It can be extinguished and stopped if necessary. Actually, during the peaks of the load on the main base, priority was given to her, and this part of the calculations was inhibited. But the combat base can not be extinguished in any case, and it should work just always. Even a couple of minutes of inactivity can cost a couple of million rubles. Therefore - only Highland.
I really need the right support with a good SLA. In such systems, duplication is always used, and the failure of the second circuit is considered as a serious accident. The base and services are switched, but it is necessary to go and repair immediately, because if the reserve equipment falls, this is the finish. From here - spare parts warehouses in Moscow, competent engineers, guarantees that someone will come and decide. In general, as usual. For example, here is the story of my colleagues serving Dell servers.

Farther

The stages were as follows:

Miscalculations and analysis (especially important questions were that actual operational costs after implementation were low).
Testing with us.
Purchase the first array for OLAP.
Wow! Oracle feels at least 2 times better; no specialists are needed for maintenance.
We are waiting for six months. During the six months of work does not require attention.
Once again we consider the case. The cost for 1 TB is lower than high-end solutions.

Violin GUI

Oracle Database Performance Statistics Application

Six months after the test began to upgrade the second section. There was a competition, again the victory of our solution with Violin, the introduction. As far as I know, the customer was also considered VNX and Hitachi, plus there were mixed systems. Violin was the cheapest with the fact that it is also a full flash. In the server there are still a lot of old storage systems, but everything is important and alive already on the flash.

As you can see, the example is interesting. If you want me to think about the case as a contest only for Violin - write to VBolotnov@croc.ru, I will say whether it makes sense to try to optimize in your situation this way.

Source: https://habr.com/ru/post/250389/

All Articles

Server insurance company refactoring: when there is less physical space than data

Changes

Why Violin?

Farther

More articles: