📜 ⬆️ ⬇️

Building a Failover Solution Based on Oracle RAC and AccelStor Shared-Nothing Architecture

A considerable number of Enterprise applications and virtualization systems have their own mechanisms for building fault-tolerant solutions. In particular, Oracle RAC (Oracle Real Application Cluster) is a cluster of two or more Oracle database servers that work together to load balance and provide fault tolerance at the server / application level. To work in this mode, you need a shared storage, which is usually in the role of storage.


As we have already considered in one of our articles , the data storage system itself, despite the presence of duplicate components (including controllers), still has points of failure - mainly as a single data set. Therefore, to build an Oracle solution with increased reliability requirements, the “N servers - one storage system” scheme needs to be complicated.




First, of course, you need to decide on what risks we are trying to insure. In this article, we will not consider protection against threats like “meteorite has flown in”. So the construction of a geographically dispersed disaster recovery solution will remain a topic for one of the following articles. Here we look at the so-called Cross-Rack disaster recovery solution, when protection is built at the level of server cabinets. The cabinets themselves can be located in the same room or in different rooms, but usually within the same building.


These cabinets should contain all the necessary set of equipment and software that will ensure the operation of Oracle databases, regardless of the state of the “neighbor”. In other words, using the Cross-Rack disaster recovery solution, we eliminate the risks of failure:



The duplication of Oracle servers implies the very principle of operation of Oracle RAC and is implemented through an application. Duplication of switching facilities is also not a problem. But with the duplication of storage, everything is not so simple.


The simplest option is to replicate data from the main storage system to the backup one. Synchronous or asynchronous, depending on the storage capabilities. With asynchronous replication, the question immediately arises of ensuring the consistency of data with respect to Oracle. But even if there is software integration with the application, in any case, in case of a crash on the main storage system, manual intervention of administrators will be required in order to switch the cluster to the backup storage.


A more complex option is software and / or hardware “virtualizers” of storage systems that will eliminate consistency problems and manual intervention. But the complexity of the deployment and subsequent administration, as well as the very indecent cost of such solutions, discourages many.


Just for scenarios like Cross-Rack disaster recovery, the All Flash AccelStor NeoSapphire ™ H710 array solution using Shared-Nothing is a great solution. This model is a two-bin storage system that uses its own FlexiRemap® technology for working with flash drives. Thanks to FlexiRemap® NeoSapphire ™, the H710 is capable of delivering performance up to 600K IOPS @ 4K random write and 1M + IOPS @ 4K random read, which is unattainable when using classic RAID-based storage.


But the main feature of the NeoSapphire ™ H710 is the execution of two nodes in the form of separate cases, each of which has its own copy of the data. Synchronization of nodes is carried out through the InfiniBand external interface. Thanks to this architecture, it is possible to spread nodes on different locations up to 100m away, thus providing the Cross-Rack disaster recovery solution. Both nodes work completely in synchronous mode. From the side of the H710 hosts looks like an ordinary dual-controller storage system. Therefore, no additional software and hardware options and especially complex settings are required.


If we compare all the above-described Cross-Rack disaster recovery solutions, then the AccelStor variant stands out noticeably from the rest:


AccelStor NeoSapphire ™ Shared Nothing ArchitectureSoftware or hardware "virtualizer" storageReplication Based Solution
Availability
Server failureNo downtimeNo downtimeNo downtime
Switch failureNo downtimeNo downtimeNo downtime
Storage failureNo downtimeNo downtimeDowntime
Failure of the entire cabinetNo downtimeNo downtimeDowntime
Cost and complexity
Cost of solutionLow *HighHigh
Deployment complexityLowHighHigh

* AccelStor NeoSapphire ™ is still the All Flash array, which, by definition, does not cost “3 kopecks,” especially since it has a double reserve of capacity. However, comparing the final cost of the solution based on it with those of other vendors, the cost can be considered low.


The topology of the connection of the application servers and nodes of the All Flash array will be as follows:



When planning the topology, it is also highly recommended to make duplication of management switches and interconnect servers.


Hereinafter we will talk about connecting through Fiber Channel. If iSCSI is used, everything will be the same, adjusted for the types of switches used and slightly different array settings.


Preparatory work on the array


Used hardware and software

Server and Switch Specifications


ComponentsDescription
Oracle Database 11g serversTwo
Server operating systemOracle Linux
Oracle database version11g (RAC)
Processors per serverTwo 16 cores Intel® Xeon® CPU E5-2667 v2 @ 3.30GHz
Physical memory per server128GB
FC network16Gb / s FC with multipathing
FC HBAEmulex Lpe-16002B
Dedicated public 1GbE ports for cluster managementIntel ethernet adapter RJ45
16Gb / s FC switchBrocade 6505
Dedicated private 10GbE ports for data synchonizationIntel X520

AccelStor NeoSapphhire ™ All Flash Array Specification


ComponentsDescription
Storage systemNeoSapphire ™ high availability model: H710
Image version4.0.1
Total number of drives48
Drive size1.92TB
Drive typeSSD
FC target ports16x 16Gb ports (8 per node)
Management portsThe 1GbE ethernet cable
Heartbeat portThe 1GbE ethernet cable connecting between two storage node
Data synchronization port56Gb / s InfiniBand cable


Before you start using an array, you must initialize it. By default, the control address of both nodes is the same (192.168.1.1). You need to connect to them one by one and set up new (already different) management addresses and set up time synchronization, after which Management ports can be connected to a single network. After that, nodes are combined in HA pair by assigning subnets for Interlink connections.



After the initialization is complete, you can manage the array from any node.


Next, create the necessary volumes and publish them for application servers.



It is highly recommended to create multiple volumes for Oracle ASM, as this will increase the number of target servers, which ultimately improves overall performance (for more information, see the queues in another article ).


Test configuration
Storage Volume NameVolume Size
Data01200GB
Data02200GB
Data03200GB
Data04200GB
Data05200GB
Data06200GB
Data07200GB
Data08200GB
Data09200GB
Data10200GB
Gridd011GB
Grid021GB
Grid031GB
Grid4d041GB
Gridrid051GB
Grid061GB
Redo01100GB
Redo02100GB
Redo03100GB
Redo04100GB
Redo05100GB
Redo06100GB
Redo07100GB
Redo08100GB
Redo09100GB
Redo10100GB


Some explanations about the modes of the array and the ongoing processes in emergency situations



The data set of each node has a “version number” parameter. After the initial initialization, it is the same and equal to 1. If for some reason the version number is different, then data is always synchronized from the older version to the younger version, after which the younger version is aligned, i.e. This means that the copies are identical. The reasons for which versions may be different:



In any case, the node that remains online increases its version number by one in order to synchronize its data set after restoring communication with the pair.


If there is a broken connection via an Ethernet link, the Heartbeat temporarily switches to InfiniBand and returns back within 10s when it is restored.


Host configuration


To ensure fault tolerance and increase performance, you must enable MPIO support for the array. To do this, add lines to the /etc/multipath.conf file, then restart the multipath service


Hidden text
devices {
device {
vendor "AStor"
path_grouping_policy "group_by_prio"
path_selector "queue-length 0"
path_checker "tur"
features "0"
hardware_handler "0"
prio "const"
failback immediate
fast_io_fail_tmo 5
dev_loss_tmo 60
user_friendly_names yes
detect_prio yes
rr_min_io_rq 1
no_path_retry 0
}
}


Next, in order for ASM to work with MPIO via ASMLib, you must change the file / etc / sysconfig / oracleasm and then run /etc/init.d/oracleasm scandisks


Hidden text

# ORACLEASM_SCANORDER: Matching patterns to order disk scanning
ORACLEASM_SCANORDER = "dm"



# ORACLEASM_SCANEXCLUDE: Matching patterns to exclude disks from scan
ORACLEASM_SCANEXCLUDE = "sd"


Note


If you do not want to use ASMLib, you can use UDEV rules, which are the basis for ASMLib.


Starting with version 12.1.0.2 of Oracle Database, the option is available for installation as part of ASMFD software.



Be sure to ensure that the disks created for Oracle ASM are aligned with the block size that the array (4K) physically works with. Otherwise, there may be performance problems. Therefore, it is necessary to create volumes with the appropriate parameters:


parted / dev / mapper / device-name mklabel gpt mkpart primary 2048s 100% align-check optimal 1


Distributing databases to created volumes for our test configuration


Storage Volume NameVolume SizeVolume LUNs mappingASM Volume Device DetailAllocation Unit Size
Data01200GBMap All Data PortsRedundancy: Normal
Name: DGDATA
Purpose: Data files
4MB
Data02200GB
Data03200GB
Data04200GB
Data05200GB
Data06200GB
Data07200GB
Data08200GB
Data09200GB
Data10200GB
Gridd011GBRedundancy: Normal
Name: DGGRID1
Purpose: Grid: CRS and Voting
4MB
Grid021GB
Grid031GB
Grid4d041GBRedundancy: Normal
Name: DGGRID2
Purpose: Grid: CRS and Voting
4MB
Gridrid051GB
Grid061GB
Redo01100GBRedundancy: Normal
Name: DGREDO1
Purpose: Redo log of thread 1

4MB
Redo02100GB
Redo03100GB
Redo04100GB
Redo05100GB
Redo06100GBRedundancy: Normal
Name: DGREDO2
Purpose: Redo log of thread 2
')
4MB
Redo07100GB
Redo08100GB
Redo09100GB
Redo10100GB

Database Settings
  • Block size = 8K
  • Swap space = 16GB
  • Disable AMM (Automatic Memory Management)
  • Disable Transparent Huge Pages


Other settings

# vi /etc/sysctl.conf
âś“ fs.aio-max-nr = 1048576
âś“ fs.file-max = 6815744
âś“ kernel.shmmax 103079215104
âś“ kernel.shmall 31457280
âś“ kernel.shmmn 4096
âś“ kernel.sem = 250 32000 100 128
âś“ net.ipv4.ip_local_port_range = 9000 65500
âś“ net.core.rmem_default = 262144
âś“ net.core.rmem_max = 4194304
âś“ net.core.wmem_default = 262144
âś“ net.core.wmem_max = 1048586
âś“ vm.swappiness = 10
âś“ vm.min_free_kbytes = 524288 # don't set this if you are using Linux x86
âś“ vm.vfs_cache_pressure = 200
âś“ vm.nr_hugepages = 57000



# vi /etc/security/limits.conf
âś“ grid soft nproc 2047
âś“ grid hard nproc 16384
âś“ grid soft nofile 1024
âś“ grid hard nofile 65536
âś“ grid soft stack 10240
âś“ grid hard stack 32768
âś“ oracle soft nproc 2047
âś“ oracle hard nproc 16384
âś“ oracle soft nofile 1024
âś“ oracle hard nofile 65536
âś“ oracle soft stack 10240
âś“ oracle hard stack 32768
âś“ soft memlock 120795954
âś“ hard memlock 120795954


sqlplus “/ as sysdba”
alter system set processes = 2000 scope = spfile;
alter system set open_cursors = 2000 scope = spfile;
alter system set session_cached_cursors = 300 scope = spfile;
alter system set db_files = 8192 scope = spfile;




Failover test


For demonstration purposes, HammerDB was used to emulate the OLTP load. HammerDB configuration:


Number of Warehouses256
Total Transactions per User1000000000000
Virtual Users256

As a result, the indicator 2.1M TPM was obtained, which is far from the H710 array performance limit, but is the “ceiling” for the current hardware configuration of servers (primarily due to processors) and their number. The goal of this test is to demonstrate the resiliency of the solution as a whole, and not to achieve maximum performance. Therefore, we will simply build on this figure.



Test for failure of one of the nodes




Hosts have lost some of the paths to the repository, continuing to work through the rest of the second node. Productivity slipped for a few seconds due to the restructuring of the paths, and then returned to normal. There was no interruption in service.


Test for cabinet failure with all equipment




In this case, the performance also slipped for a few seconds due to the restructuring of the paths, and then returned to half the value of the original indicator. The result was halved from the initial due to the exclusion of one application server. There was no interruption in service either.


If there are needs for implementing a Cross-Rack disaster recovery solution for Oracle at a reasonable cost and with little deployment / administration efforts, then working together with Oracle RAC and AccelStor Shared-Nothing architecture will be one of the best options. Instead of Oracle RAC there can be any other software providing clustering, the same DBMS or virtualization systems, for example. The principle of building a solution will remain the same. And the total is a zero value for the RTO and RPO.

Source: https://habr.com/ru/post/448538/


All Articles