Build Failover Systems Based on Exchange 2010

Good day!
This article describes my experience in building a fault-tolerant Microsoft Exchange 2010 SP1 mail service.
It is useful mostly for beginners to understand the theory.
I will not delve into the practical aspects, but I will try to set out the theoretical basis necessary when building a failover Exchange cluster.
Everything else is under the cut. (A lot of text!)

So, if our task is to build a failover mail system based on Exchange 2010, then the only (acceptable) option would be to use the DAG (Database Access Group).
In addition to DAG, we will need several more servers to ensure the resiliency of the required Exchange roles. About them will be discussed later.

I needed to achieve the following goals:
• Fault tolerance for clients using Outlook on the internal network
• Partial fault tolerance for clients using Outlook via RPC over HTTPS from the Internet
• Partial fault tolerance for ActiveSync clients
• Minimize the chance of losing incoming (from the Internet) mail
')
In connection with the goals set before me, I began to work according to the following scheme (a small clarification - the channel between the data center and the office is 30 Mb / s synchronous):

Let me remind you that in Exchange 2010 elements of a DAG array can be more than just a Mailbox server, which is why it greatly simplifies the work in comparison with version 2007.

The first thing I do is take out the Edge Transport in the DMZ using TMG on both sites.
In the DNS records for my domain, I prescribe 3 MX records with different priority - one points to the TMG in Datacenter, the other points to the gateway in Office, the third to the backup channel in the office, which is also set up in TMG and configured via ISP Redundancy (Failover). This makes it the fourth item in my list of goals - to minimize the chance of losing incoming mail .

If the Internet is lost in the data center / the server goes down - all incoming mail arrives at the server in the office. The only option in which mail may still not be reached if the Internet disappears immediately in the data center and in the office (on two channels!) For more than 30 minutes.

The following is to provide fault tolerance for clients using Outlook 2010 on the internal network. To do this, you must ensure the fault tolerance of the Mailbox and Client Access roles.
On 2 Exchange servers with the Mailbox role on different sites, 2 databases are created - one is active on the first site, the second is active on the other.

What is it for:
1) The distribution of the load, if everything works in normal mode.
2) If the server in the data center fails / becomes unavailable, we have a copy of the database in the office, which will automatically switch to Active mode. Users can continue to work.
3) Ability to conduct server maintenance by switching active servers "to hot".

When creating a DAG cluster from a GUI, you will need to specify at least one File Witness server. File Witness server (file witness), to put it very roughly, is a server with a shared folder that will support votes in quorum in the event of a failure of one of the Mailbox servers. It is needed so that after servers that have been unavailable for some time become Back-online, they can synchronize (read replicate) changes from other cluster members and not get confused about what they already have and what lacks.

I created one File Witness server on each site (because if the Internet drops on one of the sites, both Mailbox and File Witness server become unavailable for the second part of the Exchange server) to maintain the quorum. But the task of ensuring the resiliency of Outlook clients has not yet been completed.

At this stage, the question arises: “ But does everyone have a specific Client Access server registered in Outlook ?! And if it becomes offline, then where will the clients connect? ".

It is to prevent this situation from being used by Network Load Balancing Services (Network Load Balancing Services). I used the features that are available in Windows Server 2008 R2 with the role of Cluster Services, but Microsoft strongly recommends the use of "iron" NLB.
As NLB servers, I used File Witness server to save resources.

On NLB, I configured load balancing between two Exchange servers with the Client Access role.
However, these servers [approx. - Client Access] are not Mailbox servers! We strongly recommend that you distribute these roles across different servers / virtual servers.

Now for Outlook clients, the server (which we specify when setting up an account) is the NLB server. In order to provide even greater resiliency (in case of failure of one of the NLB servers), I created in the domain 2 DNS type A records with the same name ( exchange.contoso.com for example ), each of which refers to different NLB server.
At the same time, thanks to the DNS Round Robin technology, in the event of a failure of 1 of the NLB servers, about 50% of the clients will work. Of course, the DNS cache also comes into play here, but you can read about it and about Round Robin separately in the Microsoft knowledge base.

So I executed the first item from the list of my goals - fault tolerance for Outlook clients in the internal network .

A small addition - I configured VIP addresses (the entire internal subnet) on NLB that ALWAYS (except when the server is not available) will work with Exchange in the office - to reduce Internet traffic and increase speed. Only if 1 of the servers fails, clients switch to another Client Access server (by the way, this rule also applies to mobile devices that connect to office Wi-Fi, because their DNS requests are returned IP-addresses from the local network). For me it was relevant, because I have no clients on the second site.

There are 2 more points left that relate to clients outside the local network. It's all very simple - in the public DNS records for my domain, I again indicate two (this time two, in order to save traffic on the backup channel in the office) type A records that refer to the external IP Forefront TMG in the data center and on the . IP address in the office. Autodiscover services are active both there and there.
And again, thanks to DNS Round Robin technology, partial resiliency of ActiveSync and RPC over HTTPS clients is provided.

PS I will be glad to your questions and constructive criticism.

Source: https://habr.com/ru/post/140687/

All Articles

Build Failover Systems Based on Exchange 2010

More articles: