AERODISK Engine: Disaster. Part 1

Hello, readers Habra! The topic of this article will be the implementation of means of disaster recovery in storage systems AERODISK Engine. Initially, we wanted to write in one article about both means: replication and the metrocluster, but, unfortunately, the article turned out to be too large, so we divided the article into two parts. Let's go from simple to complex. In this article we will set up and test synchronous replication - drop one data center, and also break the communication channel between data centers and see what happens.

Our customers often ask us different questions about replication, so before proceeding to setting up and testing the replica implementation, we will tell you a little about replication in the storage system.

A bit of theory

Storage replication is an ongoing process for ensuring the identity of data simultaneously across multiple storage systems. Technically, replication is performed by two methods.

Synchronous replication is the copying of data from the main storage system to the backup one, followed by the mandatory confirmation of both storage systems that the data has been recorded and confirmed. It is after confirmation from both sides (from both storage systems) that the data is considered recorded, and it is possible to work with them. This ensures guaranteed data identity across all storage systems participating in the replica.

Advantages of this method:

Data is always identical on all storage systems.

Minuses:

High cost of the solution (fast communication channels, expensive optical fiber, long-wave transceivers, etc.)
Distance limits (within a few tens of kilometers)
There is no protection against logical data corruption (if data is spoiled (consciously or accidentally) on the main storage system, then they automatically and immediately become corrupted on the backup, since the data are always identical (this is the paradox)

Asynchronous replication is also copying data from the main storage system to the backup one, but with a certain delay and without the need to confirm the record on the other side. You can work with the data immediately after writing to the main storage system, and on the backup storage the data will be available after some time. The identity of the data in this case, of course, is not ensured at all. Data on the backup storage is always a bit "in the past."

Advantages of asynchronous replication:

Low cost of the solution (any communication channels, optics optional)
No distance limits
On the backup storage, the data does not deteriorate if they are damaged on the primary (at least for a while), if the data become corrupted, you can always stop the replica to prevent data corruption on the backup storage

Minuses:

Data in different data centers is always non-identical.

Thus, the choice of replication mode depends on the business objectives. If it is critical for you to have absolutely the same data in the backup data center as in the main (i.e. business requirement for RPO = 0), then you will have to fork out and put up with the limitations of the synchronous replica. And if the delay in the state of the data is permissible or there is simply no money, then, clearly, an asynchronous method should be used.

Separately, we single out such a mode (more precisely, already a topology) as a metrocluster. In the metrocluster mode, synchronous replication is used, but, unlike the usual replica, the metrocluster allows both storage systems to work in the active mode. Those. you do not have division into active-backup data centers. Applications work simultaneously with two storage systems that are physically located in different data centers. Downtime in case of accidents in this topology is very small (RTO, usually minutes). In this article we will not consider our implementation of the metrocluster, since this is a very large and capacious topic, so we will devote a separate, next article to it, in continuation of this.

Also, very often, when we talk about replication by means of storage, many have a reasonable question:> “Many applications have their own replication tools, why use replication on storage? Is it better or worse?

There is no unequivocal answer, so we give the arguments FOR and AGAINST:

Arguments for replication storage:

Ease of solution. With one tool, you can replicate an entire array of data, regardless of the type of load and applications. If you use a replica from the application, you will have to configure each application separately. If there are more than 2 of them, then it is extremely time consuming and expensive (application replication requires, as a rule, a separate and not free license for each application. But more on this below).
You can replicate anything — any applications, any data — and they will always be consistent. Many (most) applications do not have replication tools, and replicas from storage systems are the only means to provide protection against disasters.
No need to overpay for application replication functionality. As a rule, it is not cheap, as well as licenses for the storage replica. But you need to pay once for the license for storage replication, and you need to buy a license for the application replica for each application separately. If there are many such applications, then it costs a lot of money and the cost of licenses for replicating the storage system becomes a drop in the ocean.

Arguments VS storage replication:

Replica means of applications has more functionality in terms of the applications themselves, the application knows its data better (which is obvious), so the options for working with them are more.
Manufacturers of some applications do not guarantee the consistency of their data, if replication is done by third-party tools. *

* - controversial thesis. For example, a well-known manufacturer of a DBMS, for a long time officially stated that their DBMS can normally be replicated only by their means, and the rest of replication (including SHD-shnaya) is “not true”. But life has shown that it is not. Most likely, (but this is not certain) it is simply not the most honest attempt to sell more licenses to customers.

As a result, in most cases, replication by the storage system is better, because This is a simpler and less expensive option, but there are complex cases when specific functionality of applications is needed, and it is necessary to work with application-level replication.

With theory finished, now practice

We will set up a cue in our lab. In the laboratory, we emulated two data centers (in fact, two stands next to each other, which seem to be in different buildings). The stand consists of two SHD Engine N2, which are interconnected by optical cables. Both storage systems are connected to a physical server running Windows Server 2016 using 10Gb Ethernet. The stand is quite simple, but essentially it does not change.

Schematically, it looks like this:

Logically, replication is organized as follows:

Now let's look at the replication functionality that we have now.
Two modes are supported: asynchronous and synchronous. It is logical that the synchronous mode is limited by distance and communication channel. In particular, for synchronous mode, you need to use fiber as physics and 10 Gigabit Ethernet (or higher).

The supported distance for synchronous replication is 40 kilometers, the delay value of the optics channel between data centers is up to 2 milliseconds. In general, it will work with long delays, but then there will be strong brakes when writing (which is also logical), so if you conceived synchronous replication between data centers, you should check the quality of the optics and the delays.

Asynchronous replication requirements are not as serious. More precisely, they are not at all. Any working Ethernet connection will work.

Currently, AERODISK ENGINE storage systems support replication for block devices (LUNs) over Ethernet (copper or optics). For projects where replication through the SAN factory via Fiber Channel is required, we are now adding the appropriate solution, but for the time being it is not ready, therefore in our case it is only Ethernet.

Replication can work between any ENGINE series storage systems (N1, N2, N4) from lower systems to older systems and vice versa.

The functionality of both replication modes is completely identical. Below is more about what is:

Replication "one to one" or "one to one", that is, the classic version with two data centers, main and backup
Replication "one to many" or "one to many", i.e. one LUN can be replicated to several storage systems at once
Activation, deactivation and reversal of replication, respectively, to enable, disable or change the direction of replication
Replication is available for both RDG (Raid Distributed Group) pools and DDP (Dynamic Disk Pool). However, the LUNs of the RDG pool can only be replicated to another RDG. C DDP is similar.

There are many more minor features, but there is not much point in listing them, we will mention them in the course of customization.

Replication setup

The setup process is quite simple and consists of three stages.

Network configuration
Storage Configuration
Setting rules (relationships) and mapping

An important point in setting up replication is that the first two stages should be repeated on the remote storage system, the third stage only on the main one.

Setting up network resources

The first step is to configure the network ports on which replication traffic will be transmitted. To do this, the ports must be enabled and IP addresses on them are specified in the Front-end adapters section.

After that we need to create a pool (in our case RDG) and virtual IP for replication (VIP). The VIP is a floating IP address that is tied to the two “physical” addresses of the storage controllers (the ports that we just configured). It will be the main replication interface. You can also operate not with VIPs, but with VLANs if you need to work with tagged traffic.

The process of creating a VIP for replica is not much different from creating a VIP for input / output (NFS, SMB, iSCSI). VIP in this case, we create a normal (no VLAN), but be sure to indicate that it is for replication (without this pointer, we can not add a VIP to the rule in the next step).

VIP must be on the same subnet as the IP ports between which it “floats”.

We repeat these settings on a remote storage system, with a different IP-schnick, by itself.
VIPs from different storage systems can be on different subnets, as long as there is routing between them. In our case, this example is shown (192.168.3.XX and 192.168.2.XX)

This completes the preparation of the network part.

Configuring Storage

Configuring storage for a replica differs from the usual only in that we do the mapping through the special menu "Mapping Replication". Otherwise, everything is the same as with the usual setting. Now in order.

In the previously created pool R02, you need to create a LUN. Create, call it LUN1.

We also need to create the same LUN on the remote storage system of the same volume. We create. To avoid confusion, the remote LUN is called LUN1R.

If we needed to take a LUN that already exists, then at the time of setting up the replica this productive LUN would need to be unmounted from the host, and on a remote storage system simply create an empty LUN of identical size.

Configuration of the storage is completed, proceed to the creation of the replication rule.

Configuring Replication Rules or Replication Links

After creating LUNs on the storage system, which is currently Primary, configure the LUN1 replication rule on SHD1 to LUN1R on SHD2.

Setup is made in the "Remote Replication" menu

Create a rule. To do this, specify the recipient of the replica. In the same place we set the name of the connection and the type of replication (synchronous or asynchronous).

In the field "remote systems" add our SHD2. To add, you need to use the control IP storage (MGR) and the name of the remote LUN to which we will perform replication (in our case, LUN1R). Managing IPs are needed only at the stage of adding a connection, replication traffic will not be transmitted through them, for this, the previously configured VIP will be used.

Already at this stage, we can add more than one remote system for the “one to many” topology: click on the “add node” button, as in the figure below.

In our case, the remote system is one, so we limit ourselves to this.

The rule is ready. Please note that it is added automatically on all replication members (in our case there are two of them). You can create as many rules as you like, for any number of LUNs and in any direction. For example, for load balancing we can replicate part of LUNs from SHD1 to SHD2, and the other part, on the contrary, from SHD2 to SHD1.

SHD1. Immediately after the creation of synchronization began.

SHD2. We see the same rule, but synchronization has already ended.

LUN1 on SHD1 is in the role of Primary, that is, it is active. LUN1R on SHD2 is in the role of Secondary, that is, it is in the pipeline, in case of failure of SHD1.
Now we can connect our LUN to the host.

We will do an iSCSI connection, although you can also do it by FC. Setting up an iSCSI LUN mapping in a replica is almost the same as a regular script, so we will not discuss it in detail here. If anything, this process is described in the article " Quick Setup ".

The only difference is that we create mapping in the menu “Mapping Replication”

Configured a mapping, gave LUN to a host. The host saw LUN.

Format it to the local file system.

That's it, the setup is complete. Further tests will go.

Testing

We will test three main scenarios.

Regular role switching Secondary> Primary. A regular role switch is necessary in case, for example, in the main data center, we need to perform any preventive operations and for this time, so that the data are available, we transfer the load to the backup data center.
Emergency role switching Secondary> Primary (data center failure). This is the main scenario for which there is replication, which can help to survive a complete failure of the data center, without stopping the company for a long time.
Break of communication channels between data centers. Verification of the correct behavior of two storage systems in conditions where, for whatever reason, the communication channel between data centers is not available (for example, the excavator did not dig there and broke dark optics).

To begin, we will start writing data to our LUN (we write files with random data). At once we look that the communication channel between SHD is utilized. This is easy to understand if you open monitoring the load of ports that are responsible for replication.

Both storage systems now have “useful” data; we can begin the test.

Just in case, we will look at the hash sums of one of the files and write them down.

Regular role switching

The operation of switching roles (changing the direction of replication) can be done with any storage system, but you still need to go to both, since you need to disable mapping on the Primary and enable it on the Secondary (which will become Primary).

Perhaps now a reasonable question arises: why not automate it? The answer is: everything is simple, replication is a simple means of disaster recovery, based solely on manual operations. To automate these operations, there is a metrocluster mode, it is fully automated, but its configuration is much more complicated. We will write about the metro cluster setting in the next article.

On the main storage, we disable mapping to ensure that the recording is stopped.

Then on one of the data storage systems (not important, on the main or backup) in the “Remote Replication” menu, select our REPL1 link and click “Change role”.

After a few seconds, LUN1R (backup storage) becomes Primary.

We do LUN1R mapping with SHD2.

After that, on the host, our E: drive automatically clings, only this time it “flew in” from LUN1R.

Just in case, we compare hash sums.

Identically. Test passed.

Emergency switching. Data Center Failure

At the moment, the main storage after standard switching is SHD2 and LUN1R, respectively. To emulate an accident, we turn off the power on both the storage controllers 2.
No more access to it.

We look that occurs on SHD 1 (reserve at the moment).

See that Primary LUN (LUN1R) is unavailable. There was an error message in the logs, in the information panel, as well as in the replication rule itself. Accordingly, data from the host is currently unavailable.

Change the role of LUN1 to Primary.

Cause mapping to host.

Make sure that drive E appears on the host.

Checking the hash.

Everything is good. The fall of the data center, which was active, survived successfully. The approximate time we spent on connecting the “reversal” of replication and connecting the LUN from the backup data center was about 3 minutes. It is clear that in real production everything is much more complicated, and besides actions with storage systems, many more operations need to be performed on the network, on hosts, in applications. And in life, this period of time will be much longer.

Here I want to write that everything, the test has been successfully completed, but we will not hurry. The main storage system is “lying”, we know that when it “fell”, it was in the role of Primary. What happens if it suddenly turns on? There are two roles Primary, which is equal to data corruption? Now check.
We are going to suddenly turn on the underlying storage system.

It is loaded for several minutes and after that it returns to service after a short synchronization, but already in the role of Secondary.

All OK. Split-brain did not happen. We thought about this, and always after the fall of the storage system rises in the role of Secondary, regardless of the role in which it was "during life." Now we can definitely say that the data center failure test was successful.

Failure of communication channels between data centers

The main task of this test is to make sure that the storage system does not begin to wonder if it temporarily lost the communication channels between the two storage systems and then reappeared.
So. Disconnect the wires between the storage systems (imagine that they dug an excavator).

At Primary, we see that there is no connection with Secondary.

On Secondary we see that there is no connection with Primary.

Everything is working fine, and we continue to write data to the main storage system, that is, they are guaranteed to differ from the backup, that is, they have “parted”.

In a few minutes we are “repairing” the communication channel. As soon as they saw each other's storage systems, data synchronization is automatically turned on. Here from the administrator nothing is required.

After a while, the synchronization ends.

The connection is restored, no abnormal situations have caused a break in communication channels, and after switching on, synchronization has automatically passed.

findings

We have analyzed the theory of what is needed and why, where are the pluses, and where are the minuses. Then configured synchronous replication between two SHD.

Next, the main tests were carried out on the regular switching, data center failure, and the interruption of communication channels. In all cases, the storage system worked well. There are no data losses, administrative operations are minimized for a manual scenario.

Next time we will complicate the situation and show how all this logic works in an automated metro cluster in active-active mode, that is, when both storage systems are basic, and the behavior in case of storage failure is fully automated.

Please write comments, we will be happy for sensible criticism and practical advice.

Until new meetings.

Source: https://habr.com/ru/post/456348/

All Articles