Drunk database: as at 1 base, we made 7 test pads, each with its own increment and diff

Imagine an insurance company with a productive base of 30 TB. It lies on such a large iron store, it is served by a very, very heavy server. Everything is beautiful. Now imagine that you wrote a feature or a piece of functionality, and you need to test it on a combat base. A piece of the base can not be pinned off for a number of reasons.

What will you do? Well, the traditional way is to take one more storage for 30–35 TB (but cheaper five times, slower, simpler, without reservation) and replicate the base on it. And then work with a copy. Good plan?

Not. The fact is that when you have several development teams (and in our case their number has grown from 4 to 10), you need, respectively, from 4 to 10 test sites. Or even more. It’s simply unrealistic to buy such hardware, so you need a solution that allows you to replicate the combat base once and then “show” it to each server as a separate test, but keeping all changes to the test site. Like this:
')

I'll tell you how we deployed 7 test pads isolated from each other on the same node with the physical base.

As it was

There is a productive base of 30 TB, there are 4 groups of developers, and each time they copy the base into the test environment, and each group needs 30 TB. The process of rolling and the inevitable rolling back of the base (during the tests, as soon as it is developed breaks it) lasts a long time and captures all the engineers of the company. The stackers are involved, the debugging admins and, if there is not enough space, the infrastructure administrators and then the tender committee, which redeems the iron. The process of one test lasts several months. At the time the project started, the developers were queuing for a test environment and literally climbing onto the wall. The security men also dragged themselves a bit from what was happening, because the database juggling was not very controllable by them, but this is the least of the problems. The main thing is that the tests went slowly, with difficulty, and from writing a feature to its introduction it could have been half a year or more.

So to develop in modern business is impossible. They wrote, tested, in test operation on the production. Ideal - from 3 days to a week.

How it works now

1. You have a combat database (production) and a stand with test equipment - in our case, we had 27 TB of test platform for 30 TB of production (the test database was compressed - this affected the performance on the battle database).

2. We once roll this combat base on the test repository during installation to create a test base (this is a long time, 30 TB after all). And then we set up the replication process once a day:

3. Now, if we connect several test servers to the test database, everyone will make their own changes to the data, and all this will sooner or later turn the tests into hell. How, in fact, happened before the start of the project. Therefore, we need to take and “propagate” the database, storing the changes for each test server. This is what happens:

Here the server with Delphix software acts as a virtualization platform. It is included in the gap between the production server and the test storage system and gives each individual production server its own virtual version of the database. When a test server writes to this version, the data is entered not into the base itself, but into a separate diff — a small base that displays only the changes created by this separate test server. Thus, on the basis of one large main test base and four different small bases with its changes, 4 different test stands are obtained.

Each base with diff is also stored in incremental layers for 20 minutes: you can roll back the results of any test command for 20 minutes. If the first team broke the base, then this applies only to her - the rest do not see her diff and his statuses.

4. This is already cool, but there is another final touch that relieves the main base from unnecessary replications. Here he is:

We began to take a copy not from the main production of the database, but from the backup production of the database.

Note: the transition from production to backup is injective, and from backup to test base is also one-way. Only individual diffs can be broken.

What happened on the infrastructure

Battle stand unchanged. Test bench: we took more production servers (low-power compared to the battle ones - this is iron, which was replaced and moved from combat to test 3-4 years ago). DSS needs a little more - store diffs. Plus a server with a virtual appliance Delfix. Virtual servers are stuck to the mirror. Everything.

Test servers - virtual machines on the same physical server where Delphix is running

The best part is that admins and engineers of storajists are now only pulling for military incidents. Each development team plays in its sandbox for its own pleasure. They have a GUI and console that allow you to do everything yourself:

But what admin sees:

Well, of course, they can even go through the admin, if something is complicated. A new instance, for example, is done through it.

Another nice option that security men suddenly liked was masking the base. Delfix can not only make mirrors, but also hash the data, either make random noise in them, either average them or mask them. This option is actively used in different telecoms, by the way.

Teams from January of this year ride a new scheme, very satisfied. The number of development teams has grown.

In this story, everything went smoothly. From surprises - another customer once had an old storage system from a test stand crumbled, but after an investigation it turned out that the problem was in the hardware, not in the software. During the implementation, the bandwidth did not work, the initial sync alone took 4 days. As a result, it was done slowly and sadly by the standard oraklovy copy mechanism from a DRP mirror, and it was unpleasantly long, because the back-up backup lies in a separate data center. Together with the vendor, we found a way to play with the settings (it took a small network diagnostics, found a number of problems), worked with the packaging and reduced the time for the entire database to roll up to a day. Vendor as a whole knew what to do - for them the situation is quite well known.

Now 7 teams use 27 TB - this is a base compressed during initialization (up to a third) and their diffs for the year. I must say that they had a lot of char data there, so it worked out so well. The customer plans to give access to three more groups, the power allows.

The limitation, of course, is that it is impossible to conduct full load testing on such an infrastructure. If it is needed, separate physical glands are needed.

Yes, answering the question why a complete database is needed: there is logic on the embedded procedures (this time), and many tests are completely unrepresentative on short samples - if you could get around cut down to 1000 records per database table, the customer, of course, I would not fence such a garden.

World experience

Usually such tasks are solved by software and hardware zoo - as a rule, it is some kind of Golden Gate or an analogue, then snapshots at the level of arrays, then - binding from their scripts. Our case is the first major introduction of Delpix in Russia.

Naturally, before that, we looked at how software is used around the world, and this is what we found interesting. Here is the infrastructure for MDM, which saved some people (a contract under a partial NDA) of 6 months in the US:

Delphix Server: Sync Multiple Source Databases Down to the Second

NASA history:

Another interesting story is the City Index, Clean Harbors and SABMiller. All of them as a whole boil down to the fact that the teams were able to develop very quickly - from the idea of the feature to the implementation from two weeks. This is very important: Gref told about this, saying that our banks are lagging behind in speed of implementation.

Three weeks on launch

In January, the customer provided detailed technical requirements. Still not knowing exactly what he wants, but with very good technical input. In short, there was not enough space for 7 bases, we wanted to purchase storage systems. We initially looked for solutions that would save either on space or on storage hardware. Well, first of all, they will allow to implement the requirements. Purchasing new x7 hardware is not an option (expensive and incredibly long).

We counted several solutions, but almost everyone had either incomplete coverage or the load on the combat database was significantly increased. Any third-party script on top of production directly influenced the queues in the branches. Snapshot slowed down - all branches received brakes immediately.

We decided to deliver Delphix in this configuration, we requested a test. We discussed everything with the vendor and the admins of the customer in a video conference, got the ISO image from the vendor, rolled out to look.

Drove about three weeks. The product was very clear to the customer, almost all intuitive. They understand what to do on the interface. We have been working longer with a solution for integrating with equipment than they mastered it themselves. Further - test operation, expansion of media, implementation.

Summary

The tool is excellent: it is easy to implement, there are almost no pitfalls, the only thing is that the first synchronization is difficult. Highly recommend for similar tasks.