Probably, everyone has already heard that in accordance with the Federal Law of July 21, 2014 No. 242-FZ “On Amendments to Certain Legislative Acts of the Russian Federation Regarding the Clarification of the Procedure for Processing Personal Data in Information and Telecommunication Networks” that came into force, storage and processing of personal data in Russia. The topic, of course, touched almost all foreign financial organizations represented in our country. The wheel spun and by the will of fate we won the execution of a project of a foreign bank to create an IT infrastructure for the migration of its information systems (IS) to Russia. Sorry, the contract includes NDA, therefore we cannot name the bank. But we can tell you how we implemented all this, what solution we proposed, architecture, SPD, which vendors - in general, we pass all our experience below.
Personal data is any information relating to an individual or determined on the basis of this information. This includes a wide range of data in information systems. Practical any information about customers in one form or another can be attributed to personal data.
This topic is relevant for foreign representative offices and subsidiaries of foreign organizations operating in Russia. Representative offices and subsidiaries require integration to exchange data with the information systems of the parent company. Often they use the IT infrastructure of the parent company, which is consolidated in several data centers around the world. In these data centers, almost like in the internal hosting, there are information systems for many of the company's representative offices in various countries of the whole region.
So, on the one hand, in order to comply with the Law FZ-242, foreign companies faced the issue of transferring information systems processing personal data to the territory of the Russian Federation. At the same time, business-critical integrated circuits operate on the territory of these companies, which contain the most important data for the customer and ensure the operation of business-critical processes. Clients tried to keep such systems abroad for the sake of consolidation and providing protection against notorious risks in Russia, such as raider seizures and unexpected inspections of various agencies or the “discharge” of confidential information to competitors.
In particular, the issue of the smooth operation of these IP arises very acutely for banks. Banks provide not only a variety of different services in 24x7x365 mode to private and corporate clients, but must also provide a large amount of reporting to the inspection bodies on a daily, weekly, monthly, quarterly basis. In the case of idle IP, banks suffer both direct losses - penalties for not filing reports, and indirect ones - for example, reputational losses caused by loss of loyalty and outflow of customers.
Next, we turn to a specific task that we had to solve for a foreign bank in view of the effective Federal Law.
So, the task that the bank has set for us can be formulated like this - to create the necessary IT infrastructure for the operation of portable business-critical information systems with specified performance and reliability indicators .
1. A new IT infrastructure for portable information systems should have ensured business continuity, both in the case of local failures and in the event of a disaster . In other words, it was necessary to ensure high availability and disaster recovery.
According to indicators:
RPO (Recovery point objective - how much data will be lost during disaster recovery) - in our case, the customer wanted to get recovery RPO = 0 in case of local failures, and RPO aspiring to 0 in case of a disaster in the main data center. In other words, in the event of a local failure, there should be no data loss, and in the case of a global failure, data loss should be minimal.
RTO (Recovery time objective - time for which it is possible to restore the IT system) - the customer wanted recovery time <= 1 hour in case of the worst local failure and RTO <= 2 hours in case of a disaster in the main data center.
Technical solutions to ensure such parameters are not very cheap and "bite", but the downtime in the work of a subsidiary of a major international bank for 1 day can amount to losses of millions of dollars, which speaks for itself.
2. Sufficient performance of information systems on new equipment is not worse than before the move . This should have been ensured from all points of view - for example, as computing resources, and in terms of storage.
Part of the customer's IT systems worked on the IBM Power platform. For this platform, we collected statistics on the use of resources, taking into account the average and peak loads. When sizing, it is important to know how much the indicators may deviate from the average value during the day, week or year, so that the IT systems maintain their performance even in the worst case of the maximum load in case of peak loads, for example, when closing a quarter.
You can also calculate the performance in some conventional synthetic indicators, such as the number of IOPS with a specific unit and the ratio of read to write. These metrics we took into account when sizing new equipment.
It is true to say that it is more interesting for the business representatives from the customer to see more realistic IP indicators, such as the time taken for typical operations to be performed before and after migration. These indicators were measured on the old platform before the beginning of the migration and were used as reference when the project was handed over to the client. The task was to ensure that in the new system, with a limited budget, these indicators did not deteriorate at least, and there was also a margin for productivity growth.
3. Efficient investment of money and efficient use of equipment . The customer set the task from a business point of view to ensure that requirements No. 1 and No. 2 are met in the minimum budget. At the same time, unlike many projects of Russian customers, not only the cost of the initial investment - CAPEX, but also OPEX - the cost of supporting and supporting the solution for 5 years was taken into account.
When a customer speaks of continuity in the case of both local and global failures, a backup data center is needed. If something in the main data center burns down, then after a while the system will be able to resume work, turning on a new place.
In our case, the IT infrastructure should have been ready no later than 2 hours after the failure. Therefore, the cold reserve in the form of an alternative platform, at best, with empty servers did not suit us.
Accordingly, either a “warm” reserve or a “hot” one was required. Customers, like normal foreigners, made demands that there be at least 100 km between data centers to exclude any influence (a power outage or, for example, a global catastrophe in Moscow). From an economic point of view, synchronous replication was not advisable, since it would have required significant investments in the channel between the data center, the traffic on which was supposed to be encrypted. From a technical point of view, delays at such a distance between data centers could already begin to affect the speed of information systems, so the option with asynchronous replication between data centers was chosen.
For information systems running on the RISC architecture, E870 servers were chosen on the IBM Power platform. They are designed to accommodate business critical loads with the highest level of availability and have a full set of RAS functionality (Reliability, Availability and Serviceability).
These servers are virtualized at the hardware level using IBM PowerVM, and virtual server partitions (LPARs) are created in them. In LPAR, processor cores are allocated to perform the load. They can be singled out in LPAR as a monopoly - without re-signing resources, or in a common pool for sharing by a pool of virtual servers. You can limit the marginal resource consumption from the shared pool of virtual servers on top in peak modes. The architecture of the Power subsystems is shown in the figure below.
Fig. IBM Power Subsystem Architecture
Any servers, even such reliable ones as the IBM Power E870, where almost everything has been dubbed, can refuse. Therefore, high availability (HA) software is used to protect against server failure. In our case, the most suitable cluster software - Veritas Infoscale. This software has a significant advantage over solutions with simple HA. It allows you to simultaneously make a local cluster (HA) both between servers on the same site and between sites (DR). As a result, the customer will be insured against local failure and failure of the entire main data center.
Veritas Infoscale allows you to organize 3-way replication of data. This is when the data is duplicated in 2 places on the same site and at the same time there is continuous IP replication in the RTC. Technically, it would have been possible to make it more simple and cheap, but a significant advantage of Veritas Infoscale software is that if one of the replicas on the local site fails, it will not be necessary to manually reconfigure replication. As a result, customer data remains permanently protected, even in the event of a local failure.
The block diagram of the created target architecture of a disaster-proof solution for a bank based on two clusters with external logical volumes is shown in the figure below.
Fig. Structural diagram of the created target architecture of a disaster-proof solution
E870 servers are expensive. On these servers, unlike simpler servers, activations of processor cores are licensed. In view of this, it was tempting to take simpler S824-type machines, but they have less reliability and, most importantly, less vertical scalability. The client has one functional task, the execution of which now could take the whole of such an S824 server. At first, the server would work, but after a couple of years, the performance would not be enough.
However, in IBM High Power servers (including the E870), you can maximize the use of processor and memory activations by integrating them into a common Enterprise Pool. Activations from the pool can be used on any of the servers in the pool. To optimize the cost of the solution on the backup servers, you can purchase fewer activations compared to the main one, which we did. At the same time, in the event of a failure on these servers, it will be possible to use the entire volume of pool activations.
For information systems running on x86 / VMware, a solution was chosen on HP Proliant Bl460 Gen9 server blades and VMware vSphere Enterprise Plus virtualization software. The architecture of the server virtualization subsystem is shown in the figure below.
Fig. Server Virtualization Subsystem Architecture
To protect against the failure of one host, VMware High Availability cluster technology was used, which allows restarting the failed host virtual machines to others.
Images and virtual machine data are stored on multiple storage systems. Business-critical data is duplicated on at least 2 storage systems and presented to hosts through the EMC VPLEX storage virtualizer. This eliminates one storage system as a point of failure. Failures on 2-controller storage systems are usually rare, but they happen. A battery of one controller's cache memory may fail, causing the cache to shut down and significant performance degradation.
Data is transferred through the storage network over Fiber Channel using 2 factories for redundancy from logical and physical failures. Factories between the ECC and the RTSOD are not united due to the significant distance between them (more than 600 km).
To protect against catastrophe in the SLC, a proven solution based on VMware SRM and replication using dedicated devices — EMC RecoverPoint — has been applied. They duplicate all the operations of the output from the MLC in the RRC in the asynchronous mode. In the case of sufficient bandwidth, this replication gives an RPO close to 0. RecoverPoint devices allow you to compress traffic between sites and transmit only unique blocks, which reduces the channel requirements.
In addition, they also allow you to roll back data volumes to a specific point in time, which provides protection against logical failures. If a logical failure occurs at a certain point in time, the administrator has the opportunity to roll back to the state before the failure.
Now many banks are thinking or are already creating their backup data center (or RCOD). However, this may not be enough, since data centers in Moscow may have a common infrastructure. Both data centers can be powered through one substation, a fan-shaped shutdown of substations can occur, optical routes to the data center can be approached through one point ...
The advantage of the EMC VPLEX + EMC RecoverPoint + VMware vSphere HA solution complex we have created allows you to provide protection against failures based on 3 data centers - two closely spaced and one located relatively far in case of a disaster. This allows the bank to receive synchronous replication with zero data loss in case of failure in one data center, as well as protection from global disasters.
In our project we have implemented 2 platforms - OTsOD and RCOD. But we placed 2 sets of equipment in the ECC. It turns out, as 2 DPC in one. In the normal case, the payload works on both, but in the event of a failure, the productive load can work on one set. This allows you to provide the highest demands on the availability of information systems in various cases of failures.
So, in terms of SPD, we needed to build a resilient solution for three data centers to ensure maximum availability of the IP.
By the beginning of the project, the customer already had two data centers with Cisco network equipment. Nexus switches were used, including FEX, as well as DMVPN with a branch network. Naturally, this determined the preservation of the vendor when upgrading the network.
In general, the architecture in the main data center we came out classic:
• as the core of the Nexus 5672;
• as access switches of Nexus 2000 series;
• Catalyst 2960X was used for management ports.
In the WAN and Internet segments:
• a pair of ASR 1001X routers for connecting to operators with L3VPN clouds;
• a pair of ASR 1001X routers for organizing DMVPN and QOS functions;
• ISR4431 routers to connect to the Internet;
• PaloAlto firewalls for communication with offices;
• Checkpoint firewalls to connect to the Internet.
In the backup data center, everything is the same except for the reused models Nexus 5548UP and ASR1001. You also need to mention the Out of band (OOB) Internet channels in each data center with separate firewalls. The schemes there are the most classic, so here they are not even attached.
With the connection to the Internet, a more interesting solution came out. The customer had only one PI / 24 network. It was necessary to:
• announce the PI network only from the main data center (CSC) while it is alive (both on the Internet and inside the PDS);
• PI network should move to the RCHD in case of the entire data center failure;
• The PI network must move in case of double failure of one type of equipment in the DCNC: core, Internet channels, routers, firewalls or WAN switches;
• PI network should not move with double failure of similar equipment in the backup data center;
• PI network should not move in case of double channel failures between data centers;
• The availability of Internet channels should be checked on several subnets from the Internet (for example, 8.8.8.0/24).
Thus, the availability of ECCs from the RTCS had to be checked through the internal network and via the Internet simultaneously. Checking the availability of the Internet with us is carried out using Cisco IP SLA. Naturally, the conditional announcement of the PI network via BGP in the direction of the operators, as well as the conditional announcement in the local network of the DPCU are used in the SLC and RODS.
Below is a logical diagram:
As already noted, the project used Nexus 5672UP switches. When using them, 2 bugs were detected, which Cisco plans to transform into features :)
The first is copper breakout DAC cables 40G to 4x10G. In our case, these cables were compatible with all equipment, but they periodically flapped. It is clear that downtime is small, but it happened quite often. The cables themselves are shown below:
In the end, after long tests, copper was replaced by optics, and all problems disappeared. Optical cables, which have been replaced below:
It seems that Cisco is planning to correct the documentation so far - exclude these cables from the list of compatible ones. So it is better not to save money and buy only optical DAC cables.
About the Nexus 7000 series there is such a document . The key phrase "passive copper optic cables are not supported on the non-EDC ports". It turns out that all passive QSFP copper may not work correctly on the Nexus 5600 series.
The second is Microsoft SLB. Everyone remembers the mode in which static ARP and MAC entries are configured on the network without IGMP. Moreover, if you do not specify static MAC records, then the packets will flood throughout the VLAN. So, now the last statement will not be true for the Nexus 5672UP. In order for the data to be transmitted, it will be necessary to specify static mac records without fail. Under this case, Cisco also plans to change the documentation.
When upgrading, the connection between the Main Data Center and the DR Data Center, as well as between the data center and branches should be transferred to encryption with GOST algorithms. There was enough L3 connectivity between the DSC and RTSOD.
Cisco networking equipment does not need to know anything about cryptographic gateways, so we built GRE / mGRE on top of encrypted tunnels. Thus, the dynamic routing protocol EIGRP remains to work.
Crypto-gateways in both data centers should build a VPN tunnel between themselves, as well as to branch offices. S-Terra was chosen as the crypto-gateway. One of the reasons was a very similar setup on Cisco. In principle, this option of working with S-Terra is described on their website, so nothing complicated usually arises.
As for the recovery time in case of failures for the network, then, as is known, it depends on the technologies and protocols used, as well as the redundancy scheme of the equipment and channels.
All network equipment and channels were reserved according to the 1 + 1 scheme. For redundancy at the data link level, popular LAP and VPC things with a sub-second time to recover from a failure were used.
In this project, the customer has already used EIGRP protocols in the local network, between the branches and BGP to interact with the operators. In the case of a local network, EIGRP does convergence with subsecond timers when a failure is detected. Compared to OSPF, EIGRP makes it much easier to configure for such results. You only need to fulfill one of two conditions for a backup route — equal cost multi path (ECMP) or feasible successor. In OSPF, this will require tyunit many timers.
In general, implicit hardware or cable failures are almost impossible in a local network. Therefore, the total convergence time will be less than 1 second. If we talk about the WAN segment and the entire distributed network (communication between the branches and the data center), then the maximum recovery time for failures will be 5 seconds.
In the case of communication with the Internet, the recovery time is obtained, of course, more - up to 1 minute. The customer already had its own PI network. As part of the upgrade, BGP began to accept the entire routing table. Actually, this is necessary in order to protect against some failures inside Internet providers. So, we rule out a situation where the BGP peer is visible, but then everything is bad with the connection.
In order to achieve Internet recovery in 1 minute in case of equipment or channel failure, it is necessary to negotiate with the operators to reduce keepalive and hold timers. In practice, we show good keepalive results in 3 seconds and hold in 10 seconds. Flaps at such values never occurs, although at 2 and 7 seconds, respectively, they are already observed. True, not all operators are ready to support such values by default, but it is possible to agree. :)
That's all. Customer satisfied. If you have questions about the decision, we will be happy to answer.
The authors of the material:
Artem Burdin, Design Engineer of the Computing Systems Department of the Competence Center for Computing Complexes of Technoserv Company
Mikhail Sheronkin, Head of Corporate Networks, Competence Center for Network Technologies, Tekhnoserv Company
Source: https://habr.com/ru/post/332982/
All Articles