⬆️ ⬇️

JunOS update on EX4500 switches in VirtualChassis - what could go wrong? Part 1

Searching on Habré, did not find the article covering this topic in the required amount, hence the post. There will be two parts, since there is a bit too much for a trial article.



Since I'm new here, let me introduce myself: my name is Yura and at the moment I am working as a network engineer in the company as a service provider. Behind him is teaching at the Cisco Networking Academy, quite a close acquaintance with the lower / middle line of their devices and the last four years with Juniper devices. For a long time I could not collect my thoughts about this post, but in the end, the thirst for epistolary recognition and the desire to help colleagues took up.



As an introduction, I can say that the principle of updating and the above calculations are valid for most (if not all) Juniper devices. In my case, these are two EX4500 switches operating in VirtualChassis (virtual chassis / VC). The VirtualChassis technology itself will not be considered by me, as this will inflate the post to indecent sizes. In addition, the topic was discussed earlier on Habré here . I will only note that it vaguely resembles VSS from Cisco: several physical devices are combined into one logical one with a common control panel, configuration, protocols, tables, etc. I won’t get too bogged down and reflect on the advantages / disadvantages, implementation and other internal subtleties - the comparison here is more like a reference mark, the differences between VC and VSS are large, as well as between Cisco IOS and JunOS.

')

My patient is two EX4500 physical switches combined into the mentioned VC (that is, one logical device) and located in the network core. Above the network are two independent BGP routers, each of which is connected to its physical EX4500 (“member” in the virtual chassis) with one link and sends the default route, summarizing the Full View table. Below is a network of servers with virtual servers and physical servers that use VC as the default gateway. The original version of JunOS 11.1R3.5, uptime - with the installation, about 800 days.



The update itself and its need for vital long before my appearance in the company (hence the rather ancient version), but either nobody decided, given the importance of the piece of iron, or there was no time for it, or “it works — don't touch it,” but it no longer important. The term of support for this version was coming to an end and the update had to be done. Since this is the core of the service provider network, among which clients are many critical to the unavailability of customers, the downtime requirements are quite stringent. Among other things, internal traffic is added (for example, backups) that passes through the core. Looking ahead to say that without a bit of luck, I would have failed these requirements. On the other hand, without a share of failure, this post would not exist.



So, according to information from the manufacturer, the system can be updated in the following ways:



  1. Normal update
  2. Nonstop Software Upgrade (NSSU)
  3. In-service Software Upgrade (ISSU)


From the end - ISSU. The idea here is that the OS is first updated on the main RoutingEngine (RE), and all management and traffic is switched to standby RoutingEngine (RE). Then there is a switch between the engines and update the backup RE. Protocols and tables for this must be synchronized between the engines - this is achieved using Graceful Routing Engine Switchover (GRES) and Nonstop Bridging / Routing (NSB / NSR) technologies.



This type of update is suitable for devices with dual physical RoutingEngine (aka ControlPlane, here you can see information about the architecture and RoutingEngins) and / or devices on which JunOS is virtualized, for example, some MX routers or QFX5100 and EX4600 switches, but only working in stand -alone mode. That is, in my case, the span on both articles.



NSSU is suitable for devices with a single physical RE integrated into a VC or VCF (VirtualChassis Fabric, to put it simply - the next generation of VC). In this case, the role of standby RoutingEngine plays RE on another device. This procedure updates the system in a virtual chassis, overloading one member device at a time, which allows minimizing downtime, provided that aggregated channels are used that are terminated on different VC members and are suitable for any supported VC / VCF configurations. The presence of GRES and NSB / NSR here is advisory in nature and allows you to reduce downtime. It sounds great, but with two reservations - the minimum version of JunOS 12.1 and the strong recommendation not to use it from Juniper TAC. The target version of the system in my case is 12.3, the highest recommended by the manufacturer at that time. It just allows NSSU, as well as NSB / NSR - it allows you to change the main RE in the chassis on the fly without loss of packets, as well as update the system without downtime for users.



In the end, considering my realities, there was only one way left - the “normal” update. Under the usual here refers to the update with a simultaneous reboot of all VC devices, respectively, about 5-15 minutes of network failure. The estimated time is taken from the manufacturer’s manuals, in my case it was about 4 minutes. The procedure for such an update (no problem) is also reviewed here . If you are tired of reading and you are not afraid of possible problems when updating, I recommend to visit.



Since I spent quite a long time preparing, I would venture to bring the full calculations, even considering that not all of them were applicable in my situation. In addition, it is likely that they will be useful to you.



First of all, the manufacturer (and now I) recommends that you check and update the JunOS loader version and the layout of the internal storage. In my particular case, it was not necessary to do this, but later I will show why this may be important.



If you update devices from version 10.4R2 or earlier to 10.4R3 or later, you will need to update the bootloader specific for each platform in addition to the actual OS upgrade. Upgrading the bootloader will also reformat the internal storage and create a second partition to store the operating system. As a result, there will be two sections with the operating system on the device - one active and one spare, as well as two sections for other files (logs, home directory, etc.). In what situations need to update the bootloader? First, if the output of the show chassis firmware command does not contain information about the bootloader version after the words "U-boot", before the brackets:



user@switch> show chassis firmware Part Type Version FPC 0 uboot U-Boot (May 19 2010 - 05:03:13) loader FreeBSD/PowerPC U-Boot bootstrap loader 2.2 


Note that this command, as well as many subsequent ones, displays information about each individual device member of the VC. Each device is identified by its FPC number (0, 1, 2, etc., depending on the number of devices). In this case, the device was one, as there was not enough old devices in the production and I took one of the labs.



Second, if the internal storage contains only three sections (referred to as slice) in the output of the show system storage command:



Command output
user @ switch> show system storage

fpc0:

- Filesystem Size Used Avail Capacity Mounted on

/ dev / da0s1a 184M 120M 50M 71% /

devfs 1.0K 1.0K 0B 100% / dev

/ dev / md0 35M 35M 0B 100% / packages / mnt / jbas e

/ dev / md1 18M 18M 0B 100% / packages / mnt / jcry pto-ex-10.3R1.9

/ dev / md2 6.4M 6.4M 0B 100% / packages / mnt / jdoc s-ex-10.3R1.9

/ dev / md3 145M 145M 0B 100% / packages / mnt / jker nel-ex-10.3R1.9

/ dev / md4 22M 22M 0B 100% / packages / mnt / jpfe -ex42x-10.3R1.9

/ dev / md5 46M 46M 0B 100% / packages / mnt / jrou te-ex-10.3R1.9

/ dev / md6 27M 27M 0B 100% / packages / mnt / jswi tch-ex-10.3R1.9

/ dev / md7 21M 21M 0B 100% / packages / mnt / jweb -ex-10.3R1.9

/ dev / md8 126M 10.0K 116M 0% / tmp

/ dev / da0s1f 123M 1.3M 112M 1% / var

/ dev / da0s3d 314M 146K 289M 0% / var / tmp

/ dev / da0s3e 55M 78K 51M 0% / config

/ dev / md9 118M 12M 96M 11% / var / rundb

procfs 4.0K 4.0K 0B 100% / proc

/ var / jail / etc 123M 1.3M 112M 1% / packages / mnt / jweb -ex-10.3R1.9 / jail / var / etc

/ var / jail / run 123M 1.3M 112M 1% / packages / mnt / jweb -ex-10.3R1.9 / jail / var / run

/ var / jail / tmp 123M 1.3M 112M 1% / packages / mnt / jweb -ex-10.3R1.9 / jail / var / tmp

/ var / tmp 314M 146K 289M 0% / packages / mnt / jweb -ex-10.3R1.9 / jail / var / tmp / uploads

devfs 1.0K 1.0K 0B 100% / packages / mnt / jweb -ex-10.3R1.9 / jail / dev

{master: 0}



Here we are interested in the device / dev / da0 and the fact of the presence of only three sections dev / da0s1X and dev / da0s3X. Where did dev / da0s2X go? Do not ask - I did not find the answer. Anyway, the fourth slice is not here. If it is, it will be called / dev / da0s4d .



Another way to check for repartitioning is to issue the show system storage partitions command. In case the command returns an error, an update will be required.

If there is a need to update the bootloader and repartition the repository (which is recommended when any doubts are necessary), remember that this means formatting and, as a result, deleting all the information. Accordingly, it is necessary to save all the important information, but rather the entire contents of the internal flash drive. The process of updating the bootloader is trivial and repeats itself for JunOS:



 user@switch> request system software add /var/tmp/ jloader-XXX.tgz user@switch> request system reboot 


It is recommended that you update the bootloader and the system at the same time - this will allow the device to be rebooted only once. With one exception. If you upgrade your old markup system (3 partitions) to a new one (4 partitions) and at the same time update JunOS, the updated version will be written into both boot partitions (main and backup). This may be undesirable in situations where you do not know how the updated production system behaves. In this case, it will be necessary to update the bootloader and the system separately and to overload the devices twice. In the lab with one EX4200, the upgrade of the loader + system took me 13 minutes, for newer / faster devices, perhaps less.



Now back to the main topic.



As already mentioned, I did not need to update the bootloader and I started saving (backup our everything!) All files via SFTP, text config, snepshots to internal storage, and to an external USB flash drive:



 user@switch> request system snapshot media internal member 0 user@switch> request system snapshot media internal member 0 slice alternate user@switch> request system snapshot media internal member 1 user@switch> request system snapshot media internal member 1 slice alternate user@switch> request system snapshot partition media external 


Despite the fact that Juniper recommends the use of certified flash drives, almost anyone with a capacity of 2GB or less is suitable, formatted in FAT with MBR. In the above script, the first 4 lines create a snapshot of the running OS and configuration into active and backup (slice alternate) partitions on both devices in the VC. If there are more devices, repeat for each "member N". Instead of member N, you can also use the key " all-members ", but I prefer certainty. The snapshot itself allows you to later roll back to the old version of the system and / or boot from the backup partition, as well as an external flash drive in case of any problems.



Important! Before taking snapshots, check the currently running system version ( show version ) and the contents of the primary and backup sections ( show system snapshot media internal ) for the corresponding versions. The hitch is that the “request system snapshot media internal member N” command copies the current running (or running) version to the main partition, and the “request system snapshot media internal member N slice alternate” command copies from the main partition to the backup section. For some reason, in my main section, there was a more ancient version than in the backup one. Geekiness and precaution here only for good.



The last command makes a snapshot on an external flash drive and, among other things, formats it (the partition key is required) in accordance with the contents and partition table of the internal flash drive.



Before upgrading, it is advisable to copy the new version of the system to the internal storage if you are not upgrading from an external disk / FTP / HTTP server. Juniper recommends placing the system image in / var / temp. I would NOT recommend doing this, because during the update process, the contents of / var are removed and in case of trouble (like mine), you will need to reload the image again. Needless to say, this delay in case of system criticality is less than desirable. Use the better pre-created directory in the root, and after the update, clean the storage with the help of “ file delete / directory / filename ”.



Further update itself with a reboot:



 user@switch> request system software add /var/tmp/jinstall-XXX.tgz validate user@switch> request system reboot 


The “validate” key allows you to check the compatibility of the device configuration with the new version of the operating system. For my EX4500 devices, it is performed automatically and it is not required to specify it explicitly, but for others it must be set — better obviously, it won't be worse.



Why actually this long post and a bunch of explanations with warnings? After 4 minutes of rebooting and updating, I see that the interface LEDs light up only on the second device (member1). For me, this is not bad, since all LACP links are connected to both devices and (good luck, hello!) This particular device is connected to the main Internet channel, the customers are again online. The only non-LACP link is connected to member0 and connects the car park to the backup server. For me, it is not critical, since all backups have been stopped.



At this place, the happy administrator finishes his job, slurps the last of his coffee and goes home with peace of mind, smiling at the night city. In my case, the given material is 1/3 of the whole article. If this topic is of interest to users, I will continue it with pleasure - only interesting is next!

Source: https://habr.com/ru/post/320266/



All Articles