How I stopped worrying and loved Hyper-V replication

It may be strange, but in the first days at work after the New Year holidays, when everything that has fallen over the holidays has already been successfully brought back to life, many people have a desire to somehow sort the information in their head in order to bring it to a systematic form. A good catalyst for this process is the awareness of the fact that you seem to have a baggage of knowledge, but in simple words this baggage cannot be explained to a grandmother from the street or a six-year-old child. For, as popular wisdom says, could not explain to the child - it means you do not know. Anyway, the defragmentation of information has not harmed anyone yet.
But we do not have an applied psychology course, so today I’ll simply present in a systematic form of a set of pixels the maximum amount of useful information about the replication function of virtual machines in the Hyper-V environment using the example of the current version of Windows Server 2012 R2.

So, what I want to spend about an hour of your time:

We need to understand why replication is generally needed in the modern world.
Create a checklist of obvious and not-so-many points prior to setting up servers
How to properly and quickly set up replication with built-in tools. Enough detail, but without water
A few tips to optimize the replication process.
Not a word about Veeam products or other vendors.

Act one. Survey.

The meaning of the term “replication of virtual machines” is no different from the usual meaning of the word “replication” in IT: a copy of the VM from the primary host is created and maintained on a third-party host.

Let's immediately agree: replication is not backup! As snapshots are not backup, raids are not backup, and in general nothing is backup except backup, ~~for if grandfather were a grandmother~~ ...
')
But better still, just in case, I will explain why “not backup”: in case of failure of the main machine, you can always turn on the replica without delay, but if the failure was triggered not by a momentary error, but by a complex of accumulated problems at the OS or application level, then they will all be successfully reflected on the cue, and nothing good will come of it. Countless cases when, after switching on a replicated VM, it works for several minutes and dies after its parent with the same symptoms.
Thus, replication is a great tool for expanding your disaster recovery plan , allowing you to return all your services to a combat state with a minimum delay in time, but you cannot shift all responsibility to it, because nothing is perfect and there are cons everywhere.
Replication of the Hyper-V virtual machine as a process can be done in three ways:

Hypervisor built-in tools.
With the help of third-party software. Here an interesting fact lies in the ignoring of this function by some vendors for no apparent reason.
With the help of SAN. Undoubtedly, the method is the most interesting, fast and other ... but extremely expensive.

As it was stated at the very beginning, we will consider only the first item - namely, replication of Hyper-V machines using the built-in tools of Windows Server 2012 R2. Let me say that it is R2 because there is a functional gap between the first and second release, and to use a non-R2 version of the hypervisor in the production environment is almost moveton.

So, what does Microsoft offer us out of the box:

Replication using asynchronous copying of modified data from the parent machine to the replicated one. Asynchronous means that the data is transmitted not immediately after changing the original data, but after some periods of time, which allows you not to “overstrain” the source machine and the transmission channel. At the moment, the minimum replication period is 30 seconds.
To replicate Hyper-V machines, you do not need to have any special shared storage or to observe the uniformity of storage equipment at the source and the receiver.
Anything that can be virtualized can be replicated.
Replication occurs over normal IP networks and, if necessary, traffic can be encrypted.
In Hyper-V, replication is possible both between individual hosts and between clusters. And even a mixed version is possible without restrictions.
Hosts between which replication occurs can be located anywhere, on any networks, and belong to different domains.

Check list before it's too late

Point one is obvious, but often overlooked: make sure that the server to which replication will go has Hyper-V compatible hardware. Engineers tend to run anything on anything, but believe that this is not the case. Absolutely not the one. Hardware support for virtualization should be strictly mandatory.
The second obvious point: calculate how much space you will need at the receiving store, and check its speed. Removing from the distant shelf the dinosaur era storage, you risk seeing the overall data transfer rate at the same era, despite all your network gigabits.
Corollary to the previous point: based on the replication period of the averaged virtual machine, think for how much each replication point will weigh and how many such points you can afford. The maximum amount currently available is 24 rollback points.
If you plan to replicate machines that are part of a Hyper-V cluster, then you need to install and configure the Replica Broker role in the cluster. If there is a cluster on the receiving side, it must be repeated.
Check the firewall settings and routing on the entire route between the hosts. If you are not responsible for the network, then find your network salesman and try it with a hot iron until he builds the shortest and fastest route between the hosts. Everything is easier with firewalls: we need port 80 for Kerberos over HTTP and 443 for certificate-based over HTTPS. Of course, the ports can be changed during the configuration process.
If you plan to encrypt traffic between hosts, then you will need certificates, and you need to pre-decompose them across all interested servers. And do not hesitate to check the certificates for the expiration date, and make the certificate authorities trusted, if you use self-signed certificates.
Inspect all your virtual machines for VHDs. You have every chance to find disks that you absolutely do not need to replicate, which will save you time and money.
Make a list of applications for which data consistency is important. Check the health of the VSS system both on the hosts and inside the guest OS. If your applications do not use VSS (for example, not the oldest version of Oracle), they will have to pay special attention.
Consider the time for the first replication pass. At the first run, the entire machine will be transmitted through the network, and, naturally, I want to do this outside working hours. If the receiving host is outside the local network, you run the risk of not having to finish the transfer overnight or weekend and in the morning to get a very busy communication channel and a very busy host. How to avoid such a situation will be written below.
Consider that replication in Hyper-V is possible not only between two hosts, but also using intermediate servers. A sort of multi-path replication.
Analyze your current backup plan for compatibility with the replication plan. I think no one will like the situation when replication will also start during backup. Your host may not forgive you for this load. It is also worth answering the question of what will happen if you restore the machine from backup: do you need a replica in the form of a machine before the accident, or you should rather bring it to a consistent look.

I believe that it is just as fair to mention a tool from Microsoft, which allows calculating the resources necessary for replication of a single virtual machine with a certain degree of inaccuracy. It is called Capacity Planner for Hyper-V Replica Of course, you will not get the exact amount of IOPS, the load on the network and the processor, but as an evaluative tool it is quite good and will allow you to analyze your infrastructure in advance.

When you start, you will be asked to specify the main server, the server for replication, the machines to be processed and the time of measurement. I recommend to change the default 30 minutes upward to an hour. And, of course, the optimal time to start is the height of the working day. The collected data can be very cool to frighten the authorities and ask for money for new ~~toys~~ ... glands.

Act of the second. Tuning.

And then came the crucial moment! There are certificates, the network is configured, the Hyper-V role works everywhere, management tools are not forgotten, and we can proceed.
The first thing is to allow our host to act as a replication server and take the machines on board. This is done through the standard Hyper-V settings window:

All settings are transparent, but I want to focus a little on the bottom section Authorization and storage. This is not critical, but I highly recommend allowing replication only with specific hosts or groups of hosts. Not often, but there are cases when erroneous replication or ignorance triggers erroneous replication - and it’s good if it is a spare host, and clogging of the combat storage with all subsequent entertainment can happen. Solving everything is the lot of laboratories for testers and developers. Well, or just brave people =)

Call broker

Since at the very beginning we agreed that the infrastructure is like an adult (that is, the cluster is set up and is operating successfully), we need to include the role of the Hyper-V replica broker. If you do not have a cluster, you can safely skip this paragraph.

The activation procedure is simple and includes 5 buttons Next and one Finish. There is nothing to explain here, so we simply go to the cluster management wizard, select Configure Role and go through the wizard, without forgetting to give the NETBIOS a compatible name and specify the IP.

A small hint for those who first read the documentation and then do it, although real engineers do not do it, - everything described in the previous paragraph can be done directly from the broker only with the difference that the settings will be applied immediately to the entire cluster and you will not have to manually resolve replication on each server. As you can see, everything looks exactly the same:

And a little explanation about the broker's role in the replication process - when replicating machines that are not participating in the High Availability Cluster, the broker is not involved in any way. But when it comes to clustered machines, it completely takes control of all the processes associated with replication and clustering, preventing the cluster from making the wrong decision about machine availability. Therefore, the golden rule is that from now on, you should only do all the actions through the Failover Cluster Manager console, otherwise you risk being left without a cluster. Even if a meteorite falls on a combat host, the worst thing you can do in this situation is to turn on the replica machine through the Hyper-V Manager.

First went

Now we are finally ready to replicate our very first car. Like everything in Windows, we will do this through the right mouse button:

Next, a fairly standard settings wizard opens, where in the first steps we are asked for the server name (where the machine will be replicated) and asked to clarify the connection settings. Or rather, if the hosts are in the same domain, then everything will be filled without our participation, but if the servers are not familiar, and you still have to encrypt the traffic, you will have to specify all the parameters manually. The only tick noteworthy in this step is “compress the transmitted data”. Here we turn to the planning stage and see what is more important to us: to compress the information and rather finish the data transfer (which will inevitably cause additional load on the hosts), or the size and duration of the transmission is not important to us, because priority is host performance. Two boring screenshots:

The next step is to select the disks that will be involved in replication. At the end of the article, when I’m talking about general optimization, I’ll give a few tips, but for now it’s worth remembering one detail - the disk not marked for replication will be completely absent on the receiving side, i.e. it is excluded from the virtual machine configuration. If a machine cannot function without this disk, but something unimportant is stored on it (such as temporary files), then simply re-create this disk on the replicated machine.

Then again we turn to the planning stage and set the selected replication period. If, due to a misunderstanding, you are still using Server 2012, then you will not even be asked, but simply set to 5 minutes. Over time, Microsoft came to the conclusion that this behavior was not entirely correct, and in Server 2012 R2 they added the ability to choose from 30 seconds, 5 and 15 minutes. Not a fountain, of course, but better than nothing.

And be very careful when choosing a 30 second interval - you will need a really very strong host, with a very fast network and very fast storage.

The next crucial step is to indicate how many recovery points we will store. Here we indicate how often VSS snapshots will be created. In principle, you can do fine without them, but then no one can guarantee you the consistency of the data with all the ensuing consequences, especially if we are talking about applications for which it is critical.
The example in the screenshot can be interpreted in Russian in this way - we need to create a restore point every hour, store it 24 hours (this is the maximum value) and once every 4 hours create a VSS snapshot. I agree that it’s not the most transparent and easy-to-understand construction, but what we have is working with.

Next comes a very useful item for those who have very large machines or simply cannot transmit large amounts of data over the network. As we remember, when you first start from the host, the entire volume of the replicated machine must be transferred to the host, and we have three options to choose from, how we can do it:

Directly over the network with the start time of the process. The default option, no additional comments.
The most interesting, in my opinion, option. If you select it, at the first time on the sending host will be created and saved in a separate machine-clone folder. The folder will be named by the <VMname_GUID> template. The same machine will be replicated as a dummy on the receiving side. Then the folder with the fake machine can and should be copied to external media and moved closer to the second host. On the second host, the empty machine will have a new item in the menu: Import Initial Replica, i.e. the machine will be waiting for real data. We will be asked to specify the path to the data folder, they will be copied to the permanent service location, internal reconciliation processes will start, and in this passage of the first replica can be considered complete. Undoubtedly, the longer the data disk travels between hosts, the greater the difference between the machines, so do not delay this journey.
And the third option: when the host side already has a copy of the virtual machine. You simply specify this car, and then it will be used as a reference. How can this be? For example, it was restored from backup. Or left from the previous replication. It doesn’t matter, the main thing is that this machine can be used as a reference, and only mismatched data will be transmitted over the network.

Then we will be offered to take a look at all the settings entered and confirm your desire with the Finish button. They will tell us that everything went well, and they will suggest changing the network settings for replicas, since by default, they are not connected to any network (I agree that this is a very unexpected place for such a proposal), but it seems to me that it is better to explain network issues with practical examples, which will be further, but for now let's move on to Hyper-V extended replication machines.

Expanding the breadth of our depths

Like many other interesting features, extended replication of virtual machines appeared only in Windows Server 2012 R2. Extended replication allows you to configure replication not only on a point-to-point basis, but also to build entire chains, when, after replication passes from the main server (let's call it the main replica), the replica replication process starts (oil is oily, but you can't tell) to the third host

And, if it is not quite clear to many why replication is needed at all, then the availability of the ability to create a replication chain is likely to finally confuse even the most persistent. However, I offer you this, not a fictitious example. Suppose you have a large enough company with several server rooms in the same building, and you set up replication every 30 seconds so that in case of sewage breakdown and server flooding, you can quickly turn on copies of your virtual machines with minimal data loss. This is an excellent scheme, but, unfortunately, it does not protect in any way from the total de-energizing of a building or a tractor, which bites through optical channels, which are suitable for a building. In such a case, I really want to have copies of the machines somewhere on the side, updated, if not every 30 seconds, but at least once every 15 minutes, so as not to allow you to fall into the dirt with your face.

Here it is necessary to designate the rules for conducting enhanced replication of virtual machines:

The frequency of extended replication cannot be less than the main one. if the main happens every 5 minutes, the extended one cannot occur every 30 seconds
The frequency of creating VSS snapshots cannot be changed.
You can not change the list of disks involved in replication
However, you can change the authentication methods and the way to pass the first replica

The Advanced Replication Configuration Wizard is invoked by traditionally right clicking on the replicated machine and selecting the Extend Replication item. Further adjustment is exactly the same as in the case of the ordinary, so it makes no sense to consider it separately.

And so we have successfully set up, launched and checked everything, so I propose to proceed to the consideration of the behavior in the event of an accident by making a small stop near the networks.

Little about networks

It is not known for certain whether this is excessive paranoia or not, but it is customary to connect all replicas to an isolated network that does not intersect with the production network. And often the administrator has no choice at all, because in the data center on the receiving side, other subnets are used, and the replica must have completely different network settings.

And, as we can see in the screenshot below, Hyper-V provides us with the opportunity to specify the exact settings of each network adapter in case of emergency activation. Which, by the way is called failover, and we will talk about it right now.

Scary word Failover

I will begin by explaining the term Failover, since An adequate translation into the language of Pushkin and Tolstoy has not yet been invented. Fayloverom is the process of correct (read controlled) switching on, operating and shutting down a replicated machine. Example of incorrect behavior: from the host or cluster control panel, the machine is turned on using the Start button. In this case, we get a guaranteed replication collapse, followed by reconfiguring, and the whole set of funny problems inherent in having two identical machines in the same infrastructure.

So, the faylover can be of three types:

Planned
Test
Alert (or just a filer, no postscript)

Planned Fellover

Using a scheduled file server implies that you are aware of potential problems with the primary host in advance. For example, there will be work with power networks, a hurricane is moving at you, you need to turn off the host for maintenance, or the workers decided to pick it up in the ground in dangerous proximity to the cable routes.

In this embodiment, there is a small simple service, equal to the time the main machine is turned off and the replica is loaded, but the fact that switching is performed according to plan gives you the opportunity to choose the most convenient time for all.

The important point is that replication can be continued in the reverse mode, i.e. all changes made on the replica side will be transferred to the main machine when it is turned off. This allows you to completely eliminate data loss.
So, how is the scheduled filer going:

Turn off the main virtual machine. This can only be done manually to avoid erroneous shutdowns. Until the machine is completely turned off, the file master will display the corresponding error.
In the same place, on the main host, click on the disabled machine and select Planned Failover
By default, the Reverse the replication direction after failover item that provides reverse replication is not checked, and if you don’t want to lose the data accumulated during the machine’s work in the Faylover mode, check this box. An important note is that the primary host must have permission to accept replicas, which was mentioned at the very beginning, otherwise the data will simply not be accepted.

We start the faylover process and check the network availability of the raised machine for users. Here, the most frequent errors are incorrectly specified VLAN and the absence of the corresponding DNS record. Neither one nor the other master of filer checks, leaving it at the mercy of the administrator.

The funny thing about this situation is how the reverse switch occurs: we need to repeat the filer, but this time from the side of the second host, i.e. it is necessary to turn off the replica on it and make its planned file share. The decision is more than strange, but that is - that is.

Test Faylover

It is the case when the name corresponds to the functional. Replicas, like backups, I want to check to sleep a little more calmly. And the best way to check a cue is to turn it on. At first glance it may seem that this is a different name for the planned faylover, but this is not the case.

When performing a test filer on the replica side, a temporary machine is created on which you can perform various tests. For example, check with the telnet a set of ports, and if the answer is yes, be sure that the services on these ports are started successfully. One caveat - by default, the virtual machine in the test file server runs not connected to the network. Therefore, the first step is to specify the general network settings in case of a filer, re-open the wizard and see a new menu item:

Or a more interesting option: to see how an application critical to business processes behaves after installing a new patch, without forgetting to bring the machine to a specially prepared isolated network.

Of course, the test filer should be run on the replica side. The process completely repeats the planned faylover, with the only difference being that after all the necessary procedures have been completed, it must be stopped. Otherwise, the machine will continue to work until sooner or later it grows to the whole disk.

Emergency Faylover

There is only one golden rule here: never run this file server, unless it is really necessary, i.e. if there is no emergency, use only test and planned options. If you just need to see how it works, write documentation for engineers, etc., then do all the steps exclusively in a test environment.

When executing a faylover, the only option that will be available to you is the choice of the required recovery point. Next, the machine will be started no matter what. If the master does not allow you to shoot in the foot and turn on two identical machines (i.e., he will wait until the main machine is completely turned off), then in this case you will only receive a very clear, but unobtrusive warning.

As a final barrier before the point of no return, you will need to confirm the completion of the filer using the Complete-VMFailover cmdlet PowerShell. All additional restore points will be deleted, and the filer process is logically terminated.

Best Practice

Before turning to general advice, I want to touch on the topic of private optimization for a specific infrastructure. The only source of information from which far-reaching conclusions can be drawn is, of course, comprehensive monitoring. One can argue whether the Operation Manager from the System Center package is the best or not. But, since in the beginning we agreed not to consider third-party software, and even for a lot of money, we’ll skip this tool.

So, the first tool out of the box, which meets us when each Windows Server boots, is the nondescript name Best Practice Analyzer (it is located at the very bottom of the Server Manager console).

By running BPA from time to time, you can get really valuable tips on host settings that are based on accumulated events and monitoring the performance of various subsystems of your particular host and information accumulated by Microsoft itself.

For reasons unknown to me, events for Hyper-V Replica have not been placed in a separate subgroup and, although they have their own unique numbers, they go under the Hyper-V stamp. Rules related to replicas, go under the numbers from 37 to 54 inclusive.

The next in order is the Hyper-V Manager console itself. It is worth adding an additional Replication Health column to the standard window with the list of machines. As you might guess, this column will display the current status of replication.

And through the Replication menu, you can call a very detailed help on the state of the machine:

Now for the general tips:

Do not be afraid to spend extra days planning and testing.
If possible, take out the virtual machine paging file on a separate VHDX and exclude it from replication. There is absolutely no reason to transmit it.
If you decide to upgrade servers to 2012 R2, you must first upgrade the replica, and then the main server - and never vice versa. Replication does not support backward compatibility.
If you change the disk size of the source machine (made possible in Server 2012 R2), you must also change the disk at the replica. This does not happen automatically.
Use Network Throttling if you cannot use a dedicated network for replication, because The replication process is able to completely capture the entire bandwidth of the communication channel. In such cases, our QoS is everything. In my opinion, the easiest way is to configure restrictions for the vmms.exe process or for the specified ports.

Source: https://habr.com/ru/post/247779/

All Articles