Crash test for high-availability cloud platform

How to make sure that the cloud provider infrastructure really does not have a single point of failure?
Check it out in practice!
Here I will talk about how we conducted acceptance tests of our new cloud platform.

Prehistory

September 24, we opened a new public cloud platform in St. Petersburg:
www.it-grad.ru/tsentr_kompetentsii/blog/39
')
Preliminary test plan for the cloud platform:
habrahabr.ru/post/234213

And here we go ...

Remote Testing

1. Alternate shutdown of FAS8040 controllers

Expected Result	Actual result
Automatic takeover to the work node, all VSM resources should be accessible on ESXi, access to datastores should not be lost.	A successful automatic takeover of one “head” (then the second) was observed. The volumes from the first controller successfully switched to servicing the second, it is noteworthy that the procedure itself took some tens of seconds (including the detection of a “head” failure). Indicators are displayed on the nodes: options cf.takeover.detection.seconds 15

2. Disconnect all Inter Switch Link between CN1610 switches

Expected Result	Actual result
When disconnecting all Inter Switch Link between CN1610 switches, the connection between nodes should not be interrupted.	The connection between the host and the network did not disappear, access to the ESXi was carried out on the second link.

3. Alternately restarting one of the paired cluster switches and one of the Nexus

One at a time restart of one of the pair clustered switches

Expected Result	Actual result
No disruption to the NetApp cluster	NetApp controllers remain assembled in a cluster through the second switch of CN1610. Duplication of cluster switches and links to controllers allows you to safely endure the fall of a single piece of metal CN1610.
One of the ports on the nodes must remain accessible, on the IFGRP interfaces on each node one of the 10 GbE interfaces must remain available, all the VSM resources must be accessible on the ESXi, access to datastores should not be lost.	As a result of link duplication and merging them into Port Channels, rebooting one of the Nexus 5548 did not cause any emotion.

4. Alternate cancellation of one of the vPC (vPC-1, vPC-2) on the Nexus

Expected Result	Actual result
Simulation of a situation where one of the NetApp nodes loses network links. In this case, the second “head” should take control.	The e0b and e0c interfaces of the controller, respectively, were dimmed, followed by switching to ifgrp a0a “down” state and VLANs raised on it. After that, the node went into an ordinary Takoquer, we know about it from the first test.

5. Alternately Disable Inter Switch Link Between Cisco Nexus 5548 Switches

Expected Result	Actual result
Preservation of connectivity between switches.	The interfaces Eth1 / 31 and Eth1 / 32 are collected in Port Channel 1 (Po1). As can be seen from the screenshot below, when one of the links falls, Po1 remains active and there is no loss of connectivity between the switches.

6. Alternate hard shutdown of ESXi

We turned off one of the working ESXi-hosts, on which at the moment of shutdown there were test machines of different OS (Windows, Linux). The shutdown emulated the crash state of the working host. After the triggering of the inaccessibility of the host (and virtual machines on it) triggered, the process of re-registering the VM to the second (working) host began. Then the VM was successfully launched on it within a few minutes.

Expected Result	Actual result
Restart virtual machines on a nearby host.	As expected, after HA HAware was tested, the machines restarted on a neighboring host within 5-8 minutes.

7. Monitoring of monitoring

Expected Result	Actual result
Receive error messages.	What can I say ... They received multiple mailing errors and warnings, the system of applications and requests processed notifications using templates, the servicedesk responded immaculately.

The monitoring system hastily spammed in the Service Desk.

ESXi

The ITSM system analyzed these letters according to templates and created events. On the basis of identical events, incidents were automatically completed. Here is one of the incidents that was created by the ITSM system based on events in the monitoring system.

One of these incidents fell on me.

ESXi

Testing directly on the equipment side

1. Disconnect power cables (all items of equipment)

Nothing new, of course, if you do not find out that one of the power supplies is a failed one.
During the whole test, not a single piece of hardware suffered.
And NetApp has unsubscribed both for itself and for the Cluster Interconnect switches:

On the Cluster-Net switch:

In VMware vSphere host errors:

Note: The management switch Cisco SG200-26 does not have power redundancy.
This switch is involved in access network management (to control ports of storage systems, servers). Turning off the power on this switch will not cause downtime for client services. Also, the failure of the Cisco SG200-26 will not lead to loss of monitoring, as infrastructure availability is monitored through the network management, which is generated at the Cisco Nexus 5548 level. The managed switch logically stands behind it and serves ONLY to access the equipment management console.
And yet, to avoid losing control through this switch, the Automatic Transfer Switch (APC AP7721 Automatic Transfer Switch), which provides redundant power from two buses, has already been purchased to help it.

2. Alternately disabling network links from ESXi (Dell r620 / r810)

The connection between the host and the datastore did not disappear, access to the ESXi was made via the second link.

That's all. All tests were successful. Acceptance tests passed. The cloud hardware is ready to deploy a virtual infrastructure for new customers.

PS
After the tests, for a long time I didn’t let go of the feeling of power and good quality of reliable iron, which I happened to touch with my own hands during testing of the whole complex for fault tolerance.

Source: https://habr.com/ru/post/241019/

All Articles