Productive fault tolerant geographically distributed cluster operating in Active-Active scheme on IBM zEnterprise EC 12 mainframe

In many areas of human activity, there are increased demands on the performance and availability of services offered by information technology. An example of such areas is, for example, banking. If a large bank in the country refuses card processing for several hours, it will affect the daily needs and concerns of millions of users across the country, which will lead to a decrease in their loyalty until a decision is made to refuse the services of such a credit institution. Similarly, the situation is with the performance and availability of many other information systems.

The solution to problems with performance and availability is known in principle: to duplicate the nodes that provide data processing, and cluster them. At the same time, to ensure maximum load of available resources and reduce system downtime when one of the nodes fails, the cluster should work according to the Active-Active scheme. Also, the level of availability provided by a cluster located entirely in one data center may not be sufficient (for example, during a power outage in entire districts of large cities). Then the cluster nodes need to be geographically distributed.

In this article, we will talk about the problems of building a fault-tolerant geographically distributed cluster and about solutions offered by IBM based on mainframes, and also share the results of our performance testing and high availability of real banking applications in a cluster configuration, with nodes separated by up to 70 km .

Problems in building a geographically dispersed cluster operating under the Active-Active scheme

')
Conventionally, problems when building any Active-Active cluster can be divided into the clustering problems of the application and the clustering problems of the DBMS / message queue subsystem. For example, for the application we are testing there was a requirement for strict adherence to the procedure for making payments from one participant. It is clear that ensuring compliance with this requirement in Active-Active cluster mode has led to the need to make significant changes to the architecture of the application. The problem was aggravated by the fact that, in addition to performance in Active-Active mode, it was necessary to ensure high availability using standard software tools of the intermediate layer or operating system. The details of how we achieved our goals are worth a separate article.

Modern approaches to clustering databases are usually based on the shared disk architecture. The main objective of this approach is to provide synchronized access to shared data. Implementation details are highly dependent on a specific DBMS, but the most popular approach is based on the exchange of messages between cluster nodes. Such a solution is purely software, receiving and sending messages requires interrupting applications (accessing the operating system's network stack), and with active access to shared data, overhead can become significant, which is especially evident if the distance between active nodes is increased. As a rule, the scalability factor of such solutions does not exceed ¾ (a cluster of four nodes, each of which has a performance of N, provides a total performance of N).

If cluster nodes are remote from each other, the distance and the allowable recovery time (RTO) of the solution have a significant impact on the performance. As is known, the speed of light in optics is approximately 200,000 km / s, which corresponds to a delay of 10 μs for each kilometer of optical fiber.

IBM mainframe approach

IBM mainframes use a different approach for clustering DB2 database and WebSphere MQ queue managers that is not based on software-based messaging between cluster nodes. The solution is called Parallel Sysplex and allows clustering up to 32 logical partitions (LPARs) running on one or more mainframes running on z / OS. If we consider that the new z13 mainframe can have on board more than 140 processors, then the maximum achievable cluster performance can not fail to impress.

The basis of Parallel Sysplex is Coupling Facility technology , which is a dedicated logical partition (LPAR), using processors that can perform special cluster software - Coupling Facility Control Code (CFCC). Coupling Facility stores in its memory shared data structures shared by Parallel Sysplex cluster nodes. Data exchange between the CF and the connected systems is organized on the principle of "memory-memory" (similar to DMA), while the processors on which the applications run are not involved.

To ensure high availability of CF structures, synchronous and asynchronous data replication mechanisms between them are used by means of the z / OS operating system and the DB2 DBMS.

Those. IBM's approach is not based on a software, but on a hardware and software solution that provides data exchange between cluster nodes without interrupting running applications. This achieves a scalability factor close to unity , which is confirmed by the results of our load testing.

Cluster load testing

To demonstrate the scalability of the system and assess the impact of distance on performance indicators, we conducted a series of tests, the results of which we want to share with you.

As a test bench, one physical server, the zEnterprise EC 12, was used. On this server, two logical partitions (LPAR) MVC1 and MVC2 were deployed, each of which was running the z / OS 2.1 operating system and contained 5 general-purpose processors (CP ) and 5 specialized zIIP processors designed to execute the DB2 DBMS code and run Java applications. These logical partitions were clustered using the Parallel Sysplex mechanism.

For the operation of the Coupling Facility, we used dedicated logical partitions (LPAR) CF1 and CF2, each containing 2 ICF processors. The distance between CF1 and CF2 over communication channels according to the test scenario varied discretely, ranging from 0 to 70 km.

The connection of MVC1 to CF1 was carried out through virtual channels like ICP, while for accessing CF2 physical channels of type Infiniband were used. The connection of MVC2 to CF1 and CF2 was carried out in a similar way: through ICP channels to CF2 and with the help of Infiniband to CF1.

The data storage was done using an IBM DS 8870 disk array, in which two sets of disks were created and synchronous replication of data between them was set up using Metro Mirror technology. The distance between sets of disks over communication channels according to the testing scenario was discretely varied, ranging from 0 to 70 km. The required distance between the stand components was ensured by using specialized DWDM devices (ADVA) and fiber optic cables of the required length.

Each of the logical partitions MVC1 and MVC2 was connected to disk arrays using two FICON channels having a bandwidth of 8 GB / s.

A detailed diagram of the stand (in the variant with the distance between cluster nodes of 50 km.) Is shown in the figure.

As a test application, a banking payment system was deployed on cluster nodes using the following IBM software: DB2 z / OS DBMS, WebSphere MQ message queue managers, WebSphere Application Server for z / OS application server.

The setup of the test bench and the installation of the payment system took about three weeks.

Separately, you should talk about the principle of the application and the load profile. All payments processed can be divided into two types: urgent and non-urgent. Urgent payments are made immediately after they are read from the input queue. In the process of processing non-urgent payments can be divided into three successive phases. The first phase - the reception phase - payments are read from the input queue of the system and are registered in the database. The second phase - the phase of multilateral netting - runs on a timer several times a day and runs strictly in single-threaded mode. At this phase, all non-urgent payments accepted before this are posted, taking into account the mutual obligations of the system participants (if Petya transfers 500 rubles to Vasya and 300 rubles to Sveta, and Vasya - 300 rubles to Sveta, it makes sense to immediately increase the value of Sveta’s account by 600 rubles and Vasya - 200 rubles, than to perform real wiring). The third phase - the phase of sending notifications - starts automatically after the completion of the second phase. At this phase, the formation and distribution of information messages (receipts) for all participating in the multilateral netting accounts. Each phase represents one or more global transactions, which are a complex sequence of actions performed by WebSphere MQ queue managers and DB2 database management systems. Each global transaction ends with a two-phase commit. It involves several resources (queue - database - queue).

An important feature of the payment system is the ability to make urgent payments against the background of multilateral netting.
The load profile during the tests repeated the load profile of a really working system. At the entrance, 2 million non-urgent payments were submitted, combined into packages of 500/1000/5000 payments, with each package corresponding to one electronic message. These packets were fed to the input of the system with an interval of 4 ms between the arrival of each subsequent message. In parallel, about 16 thousand urgent payments were submitted, each of which had its own e-mail. These e-mails were submitted with an interval of 10 ms. between the receipt of each following message.

Load testing results

When carrying out load testing, the distance through communication channels between sets of disks, as well as between logical partitions CF, changed discretely, ranging from 0 to 70 km. When performing tests, the following results were recorded:

It is interesting to compare the first two lines, showing the scalability of the system. It can be seen that the execution time of the first phase when using a cluster of two nodes is reduced by 1.4 times (the “non-linearity” of scaling of the first phase is explained by the peculiarities of the implementation of the requirement to observe the payment procedure from one participant), and the execution time of the third phase is almost 2 times . Multilateral netting is performed in one thread. Time spent in cluster mode increases due to the need to transfer DB2 locks to CF when performing bulk DML operations.

When the mechanism of replication of data on disks (Metro Mirror) is turned on, the execution time of each phase increases slightly, by 3–4%. With a further increase in the distance along the communication channels between the disks and between the logical partitions CF, the execution time of each phase increases more intensively.

It is interesting to compare the temporal characteristics of processing urgent payments in extreme cases: without a cluster and in a cluster whose nodes are separated by a distance of 70 km.

The maximum payment processing time increased by 40%, while the average changed by only 7.6%. At the same time, the utilization of the processor resource changes insignificantly due to an increase in the access time to the CF structures.

findings

In this article, we looked at the problems that arise when building a geographically distributed cluster operating in Active-Active mode, and found out how to avoid them by building a cluster based on IBM mainframes. Assertions about the high scalability of solutions based on Parallel Sysplex cluster are demonstrated by the results of load testing. In this case, we checked the behavior of the system at different distances between cluster nodes: 0, 20, 50 and 70 km.

The most interesting results are the demonstrated close to unity coefficient of scalability of the system, as well as the revealed dependence of payment processing time on the distance between cluster nodes. The smaller the distance affects the key performance indicators of the system, the farther you can spread the cluster nodes from each other, thereby ensuring greater disaster recovery.

If the topic of building a geographically dispersed cluster of mainframes arouses interest among habrazhiteley, then in the next article we will describe in detail the tests of high availability of the system and the results obtained. If you are interested in other issues regarding mainframes, or you just want to argue, then welcome to the comments. It will also be useful if someone shares their experience in building fault-tolerant solutions.

Source: https://habr.com/ru/post/250485/

All Articles