📜 ⬆️ ⬇️

Cisco VSS: Fear and Loathing at Work

I wrote this post in a fit of indignation and bewilderment of how network equipment from the world leader in this segment can greatly and unexpectedly spoil the life of production-processes and for us - network administrators.
I work in a government organization. The core of our network infrastructure is VSS-pair, assembled from two Cisco Catalyst 6509E switches under the control of VS-S720-10G-3C supervisors with IOS 12.2-33.SXI6 (s72033-adventerprisek9_wan-mz.122-33.SXI6.bin) version on board Our network infrastructure is fully production and should be available almost 24 * 7 * 365. We need to coordinate and carry out any preventive work involving the slightest shutdown of the services provided in advance at night. Most often it is nightly quarterly prophylaxis. I want to tell about one of such preventive measures. And I sincerely do not want my story to repeat with you.


Immediately mark the filling of one of the VSS pair switches (the filling of the second switch is identical):


')
Seemingly completely routine operations were planned for this prevention. We were allocated a window from 2 am to 8 am. This means that at 8 in the morning everything should be agon! work. I will give the chronology of the events of that romantic night:

1. Enable jumbo frames support on a Cisco catalyst 6509E VSS pair. All commands are taken from the official Cisco guide:



# conf t
# system jumbomtu
# wr mem


The CISCO website says that some line cards do not support jumbo frames at all, or support a limited packet size:



My line cards were not included in this list. Fine.

2. Support was also included for jumbo frames on several more stacks (catalyst 3750E, 3750X) watching in VSS. But this, in my opinion, has little to do with the situation described below.

At this stage, everything was normal. The flight is normal. We go further.

3. Further, according to the plan, there was a routine cleaning of the catalyst 6509E switches from dust using the vacuum cleaner method. We do this operation regularly (once a quarter) and we did not expect anything unnatural. We decided to start with the switch, which at that time has an active virtual switch member role, in order to verify the correct implementation of the switchover. I will name his switch (1). So, the switch (1) is turned off. Switchover occurred correctly - the second switch (2) reported that it now became active. Ping lost only one package. Next, they took out and vacuum-cleaned Fan Tray and both power supplies. Pasted back. At this stage, the monitoring reported that the network is working properly - all the glands are available. Great, I thought, I turned on the switch (1) and went to the computer to wait because Loading 6509 takes about 7-8 minutes. It takes 10-15-20 minutes, and all the lights on the supervisor are red, the line cards are off and the second active switch (2) still does not see the first (#sho swi vir red) . Turned off the switch (1). Turned on again. Again, it takes about 20 minutes - the situation repeats. At that time I didn’t have a laptop with a console cable nearby to see what was going on. The network is still “flying on one wing” - a switch (2).
Here, the correct solution would be to take a laptop, connect to the switch console (1) and see what happens there? But I, having thought about the conflict of the nodes, decided to redeem the 2nd switch, connect the laptop to the console first and try to turn it on alone. In the console, I saw the following - the switch (1) fell out into rommon. Without a single mistake. Just rommon. Once again turned off - turned on - immediately rommon and all. And at 7 o'clock in the morning and in less than 2 hours the working day begins. The atmosphere is heating up. I decided that experimenting and trying to load IOS from rommon was not worth it. Turned off the switch (1). And I went to turn on the second switch, thinking to myself - well, we'll fly on one wing of the day, and next night I will deal with the problem. I stuck into the console, turned it on and ... something went wrong in the console, I see how after the complete switch seemed to load correctly, the switch frantically starts to put out all ports in turn and go into reboot. This was repeated 3 times. And only from the third time he booted in half in grief and offered to enter credentials. Logged in The network seems to be working. Breathed out. But it was not there - the enemies attacked in the monitoring of several access switches that became available, then went to the "Down". Some servers also "blinked." Different resources from different VLANs were partially not available. DNS stopped working correctly in neighboring networks. I did not understand what was happening. The pattern was not traceable. The time on the clock is nearing 8 o'clock. I sat down to read the logs, and then there is some beauty, a squall of repetitive errors like:

% DIAG-SW2_SP-3-MAJOR: Switch 2 Module 5: Online Diagnostics detected a Major Error. Please use 'show diagnostic result' to see test results.
% DIAG-SW2_SP-3-MINOR: Switch 2 Module 2: Online Diagnostics detected a Minor Error. Please use 'show diagnostic result' to see test results.

Here, Module 5 is the SUP-720-10G-3C supervisor, and Module 2 is the WS-X6724-SFP line card.

Immediately, we sent all the logs, as well as show tech-support and show diagnostic, to the company with which we have a service agreement for maintenance. Those, in turn, opened the case in Cisco TAC. And after 3 hours, I received a response from Cisco TAC that BOTH (!!!) of the supervisor and the line card are _particular_infect and must be replaced. Bingo! After analyzing the logs, Cisco TAC engineers reported that the supervisors were already broken before the reboot. And during a reboot due to faults, Self tests could not pass and resume correct operation. When we asked how we could previously find out about the malfunctioning of supervisors and, accordingly, knowing this, not to reboot them - we were not answered.

Here you have fault tolerance when 2 glands worth $ 42,000 each (the price is from the autumn Cisco GPL) die simultaneously when the chassis is restarted. And that, not counting the dead line card.

An engineer with one went to us (because at that moment they no longer had any) a replacement soup and a line card.
In the meantime, after analyzing the situation, we realized that customers who are experiencing problems with the network are stuck into this particular line card. Customers were moved to other ports. Until the end of the working day at least flown. After the end of the day, the config, vlan.dat and the new version of IOS 12.2 (33) SXJ6 were flooded onto the new replacement SUP-720-10G-3XL with a flash drive. And it was decided to start with the chassis (1) - the one that was currently turned off. Replaced the soup, pulled out all the transceivers (just in case), brought the chassis - no errors. After that, the working chassis (2) was extinguished and the transceivers were inserted into the chassis (1) - the network came to life on one repaired wing with a provisionally provided SUP-720-10G-3CXL.

I draw your attention that VSS does not start between two different soups: SUP-720-10G-3CXL and SUP-720-10G-3C swearing at:
02:16:21 MSK:% PFREDUN-SW1_SP-4-PFC_MISMATCH: Active PFC is PFC3CXL and Other PFC PFC3C

Therefore, it was a temporary solution-crutch until the moment we were brought two identical SUP-720-10G-3C.
Since leaving the production network for a long time on “one wing” is not an option at all, in the near future another night window was agreed to replace the soup and line card in the chassis (2). At this point, 2 new SUP-720-10G-3C and a WS-X6708-10GE line card were brought to the site. According to the old scenario, the firmware, vlan.dat and a fresh config from an already working chassis were flooded with new soups (1). From the chassis (2) all transceivers were disconnected from harm's way. They made sure that the chassis (2) on the new hardware is loaded without errors, turned off. Switched VSL-links included. The flight is normal. VSS has gathered. Transceivers stuck in and the network healed on both wings. Joy knew no bounds. HOORAY! This nightmare is finally over.
But the joy did not last long. A few days after this work, at the height of the working day, the VSL link between soups suddenly goes out. What the hell? On new hardware? The benefit of the second VSL link (Port-channel LACP is assembled from two ports for each chassis for the total VSL link between the chassis) between the line cards lived and Dual Active Detection bypassed us. By the method of replacing 10Gbase-LRM transceivers and rearranging into neighboring slots, it was revealed that the slot in the supervisor chassis (1) died - AHTUNG!
In the logs:



Next, the procedure for opening the case in Cisco TAC and the subsequent CISCO engineer response:



For those who have difficulties with English, I will briefly describe his answer. A Cisco TAC engineer offers to enable the diagnostic bootup level complete, “distort” the supervisor in the chassis, and monitor developments. If this is related to the Faulty Fabric Channel on the Supervisor Module or the Line cards (with a faulty switching fabric channel on the supervisor or a line card), then the problem will appear soon after a fairly short time. If the problem does not appear after 48 hours, then it can be considered “... termed to be a one time” i.e. disposable.
Now we are coordinating the next window to perform these works. I will write here an update on the results.

I hope that this article will help someone in a difficult situation or warn against it. I will answer all questions with pleasure in the comments.
All strong supervisor nerves in our, sometimes breathtaking, profession :)

UPD: Tonight a hot soup was pulled out of the active chassis (according to the above recommendation of a Cisco TAC engineer), one ping was lost, the second chassis became active. Pasted the soup back. Checked the allegedly faulty slot for 10Gbase-LRM - it works. We fly further.

Source: https://habr.com/ru/post/211685/


All Articles