📜 ⬆️ ⬇️

How I stopped worrying and loved Hyper-V replication

It may be strange, but in the first days at work after the New Year holidays, when everything that has fallen over the holidays has already been successfully brought back to life, many people have a desire to somehow sort the information in their head in order to bring it to a systematic form. A good catalyst for this process is the awareness of the fact that you seem to have a baggage of knowledge, but in simple words this baggage cannot be explained to a grandmother from the street or a six-year-old child. For, as popular wisdom says, could not explain to the child - it means you do not know. Anyway, the defragmentation of information has not harmed anyone yet.
But we do not have an applied psychology course, so today I’ll simply present in a systematic form of a set of pixels the maximum amount of useful information about the replication function of virtual machines in the Hyper-V environment using the example of the current version of Windows Server 2012 R2.

So, what I want to spend about an hour of your time:


Act one. Survey.


The meaning of the term “replication of virtual machines” is no different from the usual meaning of the word “replication” in IT: a copy of the VM from the primary host is created and maintained on a third-party host.

Let's immediately agree: replication is not backup! As snapshots are not backup, raids are not backup, and in general nothing is backup except backup, for if grandfather were a grandmother ...
')
But better still, just in case, I will explain why “not backup”: in case of failure of the main machine, you can always turn on the replica without delay, but if the failure was triggered not by a momentary error, but by a complex of accumulated problems at the OS or application level, then they will all be successfully reflected on the cue, and nothing good will come of it. Countless cases when, after switching on a replicated VM, it works for several minutes and dies after its parent with the same symptoms.
Thus, replication is a great tool for expanding your disaster recovery plan , allowing you to return all your services to a combat state with a minimum delay in time, but you cannot shift all responsibility to it, because nothing is perfect and there are cons everywhere.
Replication of the Hyper-V virtual machine as a process can be done in three ways:


As it was stated at the very beginning, we will consider only the first item - namely, replication of Hyper-V machines using the built-in tools of Windows Server 2012 R2. Let me say that it is R2 because there is a functional gap between the first and second release, and to use a non-R2 version of the hypervisor in the production environment is almost moveton.

So, what does Microsoft offer us out of the box:


Check list before it's too late




I believe that it is just as fair to mention a tool from Microsoft, which allows calculating the resources necessary for replication of a single virtual machine with a certain degree of inaccuracy. It is called Capacity Planner for Hyper-V Replica Of course, you will not get the exact amount of IOPS, the load on the network and the processor, but as an evaluative tool it is quite good and will allow you to analyze your infrastructure in advance.

When you start, you will be asked to specify the main server, the server for replication, the machines to be processed and the time of measurement. I recommend to change the default 30 minutes upward to an hour. And, of course, the optimal time to start is the height of the working day. The collected data can be very cool to frighten the authorities and ask for money for new toys ... glands.

Act of the second. Tuning.


And then came the crucial moment! There are certificates, the network is configured, the Hyper-V role works everywhere, management tools are not forgotten, and we can proceed.
The first thing is to allow our host to act as a replication server and take the machines on board. This is done through the standard Hyper-V settings window:
image

All settings are transparent, but I want to focus a little on the bottom section Authorization and storage. This is not critical, but I highly recommend allowing replication only with specific hosts or groups of hosts. Not often, but there are cases when erroneous replication or ignorance triggers erroneous replication - and it’s good if it is a spare host, and clogging of the combat storage with all subsequent entertainment can happen. Solving everything is the lot of laboratories for testers and developers. Well, or just brave people =)

Call broker


Since at the very beginning we agreed that the infrastructure is like an adult (that is, the cluster is set up and is operating successfully), we need to include the role of the Hyper-V replica broker. If you do not have a cluster, you can safely skip this paragraph.

The activation procedure is simple and includes 5 buttons Next and one Finish. There is nothing to explain here, so we simply go to the cluster management wizard, select Configure Role and go through the wizard, without forgetting to give the NETBIOS a compatible name and specify the IP.

A small hint for those who first read the documentation and then do it, although real engineers do not do it, - everything described in the previous paragraph can be done directly from the broker only with the difference that the settings will be applied immediately to the entire cluster and you will not have to manually resolve replication on each server. As you can see, everything looks exactly the same:


And a little explanation about the broker's role in the replication process - when replicating machines that are not participating in the High Availability Cluster, the broker is not involved in any way. But when it comes to clustered machines, it completely takes control of all the processes associated with replication and clustering, preventing the cluster from making the wrong decision about machine availability. Therefore, the golden rule is that from now on, you should only do all the actions through the Failover Cluster Manager console, otherwise you risk being left without a cluster. Even if a meteorite falls on a combat host, the worst thing you can do in this situation is to turn on the replica machine through the Hyper-V Manager.

First went


Now we are finally ready to replicate our very first car. Like everything in Windows, we will do this through the right mouse button:


Next, a fairly standard settings wizard opens, where in the first steps we are asked for the server name (where the machine will be replicated) and asked to clarify the connection settings. Or rather, if the hosts are in the same domain, then everything will be filled without our participation, but if the servers are not familiar, and you still have to encrypt the traffic, you will have to specify all the parameters manually. The only tick noteworthy in this step is “compress the transmitted data”. Here we turn to the planning stage and see what is more important to us: to compress the information and rather finish the data transfer (which will inevitably cause additional load on the hosts), or the size and duration of the transmission is not important to us, because priority is host performance. Two boring screenshots:




The next step is to select the disks that will be involved in replication. At the end of the article, when I’m talking about general optimization, I’ll give a few tips, but for now it’s worth remembering one detail - the disk not marked for replication will be completely absent on the receiving side, i.e. it is excluded from the virtual machine configuration. If a machine cannot function without this disk, but something unimportant is stored on it (such as temporary files), then simply re-create this disk on the replicated machine.


Then again we turn to the planning stage and set the selected replication period. If, due to a misunderstanding, you are still using Server 2012, then you will not even be asked, but simply set to 5 minutes. Over time, Microsoft came to the conclusion that this behavior was not entirely correct, and in Server 2012 R2 they added the ability to choose from 30 seconds, 5 and 15 minutes. Not a fountain, of course, but better than nothing.

And be very careful when choosing a 30 second interval - you will need a really very strong host, with a very fast network and very fast storage.


The next crucial step is to indicate how many recovery points we will store. Here we indicate how often VSS snapshots will be created. In principle, you can do fine without them, but then no one can guarantee you the consistency of the data with all the ensuing consequences, especially if we are talking about applications for which it is critical.
The example in the screenshot can be interpreted in Russian in this way - we need to create a restore point every hour, store it 24 hours (this is the maximum value) and once every 4 hours create a VSS snapshot. I agree that it’s not the most transparent and easy-to-understand construction, but what we have is working with.


Next comes a very useful item for those who have very large machines or simply cannot transmit large amounts of data over the network. As we remember, when you first start from the host, the entire volume of the replicated machine must be transferred to the host, and we have three options to choose from, how we can do it:



Then we will be offered to take a look at all the settings entered and confirm your desire with the Finish button. They will tell us that everything went well, and they will suggest changing the network settings for replicas, since by default, they are not connected to any network (I agree that this is a very unexpected place for such a proposal), but it seems to me that it is better to explain network issues with practical examples, which will be further, but for now let's move on to Hyper-V extended replication machines.

Expanding the breadth of our depths


Like many other interesting features, extended replication of virtual machines appeared only in Windows Server 2012 R2. Extended replication allows you to configure replication not only on a point-to-point basis, but also to build entire chains, when, after replication passes from the main server (let's call it the main replica), the replica replication process starts (oil is oily, but you can't tell) to the third host

And, if it is not quite clear to many why replication is needed at all, then the availability of the ability to create a replication chain is likely to finally confuse even the most persistent. However, I offer you this, not a fictitious example. Suppose you have a large enough company with several server rooms in the same building, and you set up replication every 30 seconds so that in case of sewage breakdown and server flooding, you can quickly turn on copies of your virtual machines with minimal data loss. This is an excellent scheme, but, unfortunately, it does not protect in any way from the total de-energizing of a building or a tractor, which bites through optical channels, which are suitable for a building. In such a case, I really want to have copies of the machines somewhere on the side, updated, if not every 30 seconds, but at least once every 15 minutes, so as not to allow you to fall into the dirt with your face.

Here it is necessary to designate the rules for conducting enhanced replication of virtual machines:


The Advanced Replication Configuration Wizard is invoked by traditionally right clicking on the replicated machine and selecting the Extend Replication item. Further adjustment is exactly the same as in the case of the ordinary, so it makes no sense to consider it separately.


And so we have successfully set up, launched and checked everything, so I propose to proceed to the consideration of the behavior in the event of an accident by making a small stop near the networks.

Little about networks


It is not known for certain whether this is excessive paranoia or not, but it is customary to connect all replicas to an isolated network that does not intersect with the production network. And often the administrator has no choice at all, because in the data center on the receiving side, other subnets are used, and the replica must have completely different network settings.

And, as we can see in the screenshot below, Hyper-V provides us with the opportunity to specify the exact settings of each network adapter in case of emergency activation. Which, by the way is called failover, and we will talk about it right now.


Scary word Failover


I will begin by explaining the term Failover, since An adequate translation into the language of Pushkin and Tolstoy has not yet been invented. Fayloverom is the process of correct (read controlled) switching on, operating and shutting down a replicated machine. Example of incorrect behavior: from the host or cluster control panel, the machine is turned on using the Start button. In this case, we get a guaranteed replication collapse, followed by reconfiguring, and the whole set of funny problems inherent in having two identical machines in the same infrastructure.

So, the faylover can be of three types:


Planned Fellover


Using a scheduled file server implies that you are aware of potential problems with the primary host in advance. For example, there will be work with power networks, a hurricane is moving at you, you need to turn off the host for maintenance, or the workers decided to pick it up in the ground in dangerous proximity to the cable routes.

In this embodiment, there is a small simple service, equal to the time the main machine is turned off and the replica is loaded, but the fact that switching is performed according to plan gives you the opportunity to choose the most convenient time for all.

The important point is that replication can be continued in the reverse mode, i.e. all changes made on the replica side will be transferred to the main machine when it is turned off. This allows you to completely eliminate data loss.
So, how is the scheduled filer going:
  1. Turn off the main virtual machine. This can only be done manually to avoid erroneous shutdowns. Until the machine is completely turned off, the file master will display the corresponding error.
  2. In the same place, on the main host, click on the disabled machine and select Planned Failover
  3. By default, the Reverse the replication direction after failover item that provides reverse replication is not checked, and if you don’t want to lose the data accumulated during the machine’s work in the Faylover mode, check this box. An important note is that the primary host must have permission to accept replicas, which was mentioned at the very beginning, otherwise the data will simply not be accepted.


We start the faylover process and check the network availability of the raised machine for users. Here, the most frequent errors are incorrectly specified VLAN and the absence of the corresponding DNS record. Neither one nor the other master of filer checks, leaving it at the mercy of the administrator.

The funny thing about this situation is how the reverse switch occurs: we need to repeat the filer, but this time from the side of the second host, i.e. it is necessary to turn off the replica on it and make its planned file share. The decision is more than strange, but that is - that is.

Test Faylover


It is the case when the name corresponds to the functional. Replicas, like backups, I want to check to sleep a little more calmly. And the best way to check a cue is to turn it on. At first glance it may seem that this is a different name for the planned faylover, but this is not the case.

When performing a test filer on the replica side, a temporary machine is created on which you can perform various tests. For example, check with the telnet a set of ports, and if the answer is yes, be sure that the services on these ports are started successfully. One caveat - by default, the virtual machine in the test file server runs not connected to the network. Therefore, the first step is to specify the general network settings in case of a filer, re-open the wizard and see a new menu item:


Or a more interesting option: to see how an application critical to business processes behaves after installing a new patch, without forgetting to bring the machine to a specially prepared isolated network.

Of course, the test filer should be run on the replica side. The process completely repeats the planned faylover, with the only difference being that after all the necessary procedures have been completed, it must be stopped. Otherwise, the machine will continue to work until sooner or later it grows to the whole disk.

Emergency Faylover


There is only one golden rule here: never run this file server, unless it is really necessary, i.e. if there is no emergency, use only test and planned options. If you just need to see how it works, write documentation for engineers, etc., then do all the steps exclusively in a test environment.

When executing a faylover, the only option that will be available to you is the choice of the required recovery point. Next, the machine will be started no matter what. If the master does not allow you to shoot in the foot and turn on two identical machines (i.e., he will wait until the main machine is completely turned off), then in this case you will only receive a very clear, but unobtrusive warning.


As a final barrier before the point of no return, you will need to confirm the completion of the filer using the Complete-VMFailover cmdlet PowerShell. All additional restore points will be deleted, and the filer process is logically terminated.

Best Practice


Before turning to general advice, I want to touch on the topic of private optimization for a specific infrastructure. The only source of information from which far-reaching conclusions can be drawn is, of course, comprehensive monitoring. One can argue whether the Operation Manager from the System Center package is the best or not. But, since in the beginning we agreed not to consider third-party software, and even for a lot of money, we’ll skip this tool.

So, the first tool out of the box, which meets us when each Windows Server boots, is the nondescript name Best Practice Analyzer (it is located at the very bottom of the Server Manager console).

By running BPA from time to time, you can get really valuable tips on host settings that are based on accumulated events and monitoring the performance of various subsystems of your particular host and information accumulated by Microsoft itself.

For reasons unknown to me, events for Hyper-V Replica have not been placed in a separate subgroup and, although they have their own unique numbers, they go under the Hyper-V stamp. Rules related to replicas, go under the numbers from 37 to 54 inclusive.

The next in order is the Hyper-V Manager console itself. It is worth adding an additional Replication Health column to the standard window with the list of machines. As you might guess, this column will display the current status of replication.


And through the Replication menu, you can call a very detailed help on the state of the machine:




Now for the general tips:

Source: https://habr.com/ru/post/247779/


All Articles