How we implemented SD-Access, and why it was needed

The main monitoring page.

SD-Access is the implementation of a new approach to building local area networks from Cisco. Network devices are combined into a factory, an overlay is built on top of it, and all this is controlled by a central component - the DNA Center. All this has grown from network monitoring systems, only now the mutated monitoring system not only monitors, but collects detailed telemetry, configures the entire network as a single device, finds problems in it, offers solutions to them, and in addition enforces security policies.

Looking ahead, I will say that the solution is rather cumbersome and at the moment is not trivial in terms of development, but the larger the network and the more important the security, the more profitable it is to move on: it seriously simplifies management and troubleshooting.
')

Prehistory - how did we decide on this?

The customer moved to a new, freshly purchased office from a rented office. The local network was planned to be made according to the traditional scheme: core switches, access switches, plus some familiar monitoring. At that time, we just deployed a stand with SD-Access in our laboratory and managed to feel a little about the solution and get trained with an expert from Cisco’s French office who visited Moscow very well.

After talking with the vendor, both we and the customer decided to build the network in a new way. Here are the benefits:

SD-Access should simplify the operational routine — setting up ports and access rights for connecting users. In the new solution, these settings are made using wizards, and the port parameters are set in an obvious form in relation to the Administrators, Accounting, and Printers groups, and not to the VLAN and IP subnets. It is easier to understand, more difficult to make a mistake. For the customer, this simplification is important because his IT competence center is located in Siberia, and the office we raised is in Moscow. The center is overloaded with complex tasks and works in its time zone, therefore, the more network maintenance tasks the specialists can solve on the spot, the more time the Center has for expert work.
Part of the network troubleshooting tasks that the Center was previously engaged in, thanks to the new architecture, will also be able to be performed by on-site specialists. For this purpose, its wizards and telemetry data and traffic paths in the network are provided in an understandable form. If complex problems arise, you will still have to delve into debags, but small problems are much more often solved “on the spot” with the involvement of a smaller amount of knowledge.
It is important for the customer to ensure security: the ideology of SD-Access implies a clear separation of users and devices into groups and the definition of interaction policies between them, authorization during any client connection to the network and provision of “access rights” throughout the network. The IT division is getting used to planning and maintaining a network in the spirit of this ideology. It will be inconvenient to administer the system in another way, but if you follow the right approaches, administration becomes easier. On the contrary, in the traditional network, configs grew and became more complicated, and it became more difficult to maintain them.
The customer will have to update other offices scattered throughout the country. If SD-Access is implemented there, the strength of the previous two points will only increase.
The launch process for new offices is also simplified thanks to plug-and-play agents in the switches. To run on the motocross with the console, or even to go to the object is not necessary.

We saw the flaws later.

Planning

We figured the top-level design. The planned architecture began to look like this:

Below this is the underlay, built on the usual protocols (the base is IS-IS), but the idea of the solution is such that the subtleties of his work should not interest us. Overlay is performed on LISP and VXLAN. The solution logic assumes the preferred use of 802.1x authentication on access ports. However, the customer intended to use it on a mandatory basis for all initially. You can do without 802.1x and configure the network almost “in the old manner”, then you need to configure the pools of IP addresses manually, and then again with your hands on each port to prescribe which IP pool it belongs to, and make a Copy-Paste, as before In the command line, it does not work out, everything is only through the web. With this approach, the advantages of the solution turn into fat minus. Such a scheme can be used only where it is inevitable, but not on the entire network. The use of access rights is provided by SGT tags.

We ordered equipment and software, but for now everything was going, we began to “land” the design in order to understand what we were going to customize. Here we encountered the first difficulty: if earlier it was necessary to match the IP subnets and the dialing of VLAN numbers with the customer so that it fits into the schemes adopted by him, now we are not interested in this: we needed to understand which groups of users and devices network, how they interact with each other and what network services they use. Unusual for us and for the customer. It was more difficult to get such information. At first glance, it was from such data that you always had to make a start when designing networks, but in practice, the standard set of VLANs was almost always laid out, and then reality was crammed into it during the operation by the callused hands of admins. In the SD-Access paradigm there is no choice: the network is built “for business”.

The deadlines were shrinking, the equipment drove up. It was necessary to customize.

How we implemented it

The process of network implementation differs from the old schemes even more than the planning process. Previously, the engineer connected the devices to each other, set them up one by one, and received one after another working network segments. With SD-Access, the deployment process is as follows:

Interconnect all network switches.
Raise all DNA Center controllers.
Integrate them with ISE (through it all authorization takes place).
Use DNA Center to turn network switches into a factory.
Paint the roles of switches in the factory (Edge Node, Control Node, Border Node).
Configure the DNA Center groups of end devices and network users and virtual networks.
Customize the rules of interaction between them.
Apply device groups and rules to the factory.

This is the first time. Moreover, the DNA Center for primary deployment requires DNS, NTP and access to the Cisco cloud for downloading updates (from a Smart Account). At our implementation, it turned out that the DNA Center loves to update itself during the initial installation: it took about two days to bring all its components to the current versions, although it happened mostly without our participation.

An example of a assembled factory.

When the DNA Center is already working to raise a new office, it’s enough to repeat points 1, 4, 5 and 8. Thanks to the Plug-and-Play Agent, the new switches receive the addresses of the DNA Center via DHCP (optional), take the preliminary configs from there and become visible DNA Center management interface. It remains to paint their roles (Egde / Control / Border), and the new factory is ready. Groups of devices and policies on it can be used old.

Of course, when confronted with such a process for the first time, it is difficult to understand from which side to approach it. In addition, along with the SD-Access paradigm and related products, Cisco has generated so many new terms and definitions that it will enable even an experienced CCIE to feel young again. Here are the main ones:

Scalable Group - groups of devices with similar access rights to network resources: those same "Administrators", "Accounting", "Printers", etc.
Virtual Network is an isolated L2-L3 segment, which includes groups of devices. Essentially, VRF. It is assumed that the connection between such segments will occur through the firewall. It makes sense to split groups on such virtual networks in cases when maximum access control is needed, for example, you can select three different Virtual Networks for a video surveillance system, office staff and his guests.
Access Control Contract - rules for networking between groups.
Control Plane / Edge / Border / Intermediate Node - different types of switches in the DNA Center factory depending on their functions: Edge Node - connecting users; Control Plane Node - tracking client connections, LISP Map-Server and Map-Resolve; Border Node - providing communication with external networks; Intermediate Node - intermediate switches, analogue of Distribution-level in traditional networks.
The role of the device in the network (Device Role) is how the DNA Center sees the role of the device depending on what connections it sees: Access, Distribution, Core, Border Router, or Unknown. This attribute may come in handy, for example, when determining the recommended IOS version: you can assign different recommended versions for access switches and kernel switches if the switch series is the same.

In general, how to learn concepts should be those who introduce, and those who will exploit it. From ignorance, implementers tighten deadlines, and then the admins then drop KPIs. So you can stay without bonuses. But the distrust of the customer’s management towards the chosen solution is a problem for everyone in general.

With the introduction due to the fact that the customer already had to call in to the new office, we went as follows:

Created one group and one virtual network at all in OpenAuth mode without forced authorization, only connection logs.
Admins have connected workstations, printers, etc. to the network, users have moved to a new office and started working.
Next, there was one user who logically should belong to another group.
We set up this group in the DNA Center and the policy of its interaction with other groups.
Moved the user to this new group and enabled for him ClosedAuth with authorization.
Together with the customer’s specialists, they identified the access problems encountered by the user and corrected the contract settings (policies of the interaction of his group with the others).
When they were sure that the user was working without problems, they moved other users who should belong to it to his group and watched what was happening.

Then the items from the 3rd to the 7th needed to be repeated for new groups until all users and devices connected to the network were in their own groups. When operating in OpenAuth mode, the client device attempts authorization. If successful, the port to which it is connected applies the settings corresponding to the group to which this device belongs, and if unsuccessful, it enters the IP Pool pre-configured on the switch port - a kind of rollback to the traditional mode of the local network.

Of course, as with any new product, we spent quite a few hours updating software and identifying bugs. Fortunately, Cisco TAC helped with this promptly. One morning, logging into the DNA Center's web interface, we found that the entire network was down. At the same time, not a single complaint from users: the office works while drinking morning coffee. Rummaged in the logs, and it turned out that there was a problem with SNMP, through which DNA Center receives information about the state of the factory. The network is not visible, but it is. The elimination of part of the OID from the polling helped.

Component version page.

How to exploit it?

DNA Center collects a bunch of useful SNMP, Netflow and Syslog data from the factory and knows how to present it. This is especially useful when solving floating problems like “something yesterday, many telephony fell off, although now it seems normal.” You can climb in the data of the Application Experience and try to understand what was happening. So there is a chance to fix the problem before it “flies” next time. Or prove that the network had nothing to do with it.

Data on the quality of the application.

For many of the problems that DNA Center shows as Alarm, it tells you where to dig.

An example of an OSFP Adjacency crash message with a clue what to do.

It became easier to carry out routine analysis. For example, if necessary, you can quickly track the path of traffic over the network, without climbing the devices one by one. With authorization through ISE, the DNA Center picks up and shows the names of clients, including on the wired network: no need to climb in search of an IP address.

An example of tracking traffic through a network. A red tag on one of the devices says that traffic is blocked on its access control list.

You can quickly see which network segment is covered by the problem (the switches in the DNA Center are broken down into locations, sites, and floors).

The “gamified” indicator of the quality of life of applications on the network as a percentage makes it possible to superficially assess the state of the network and see if it is not getting worse with time.

Indicator of quality of life applications.

As before, Prime Infrastructure also provides software version control on network devices. DNA Center maintains its repository, where images can be uploaded either manually or automatically downloaded from Cisco.com, and then deployed to devices. In this case, you can program and run scripts to verify the correctness of the network before and after the update. The standard guest bill script, for example, includes checking the availability of free space on the flash, the confi-register status, whether the config is saved. Software patching is supported for devices that can do it.

Software repository at the DNA Center.

And, of course, access to the command line of the network glands is still there.

Total

The product is new, new approaches can be implemented, however, carefully. Because of the newness of the code, there are bugs in the work, but Cisco technical support responds promptly, and developers release updates regularly. Due to the novelty of the network management approach, the probability of errors in the early stages of operation is quite high, but administrators gradually get used to it, and errors become less than with the support of a traditional LAN. It is worthwhile to think in advance about how to test and test everything on the part of users, and then apply it to everyone (although with experience you understand that this is useful when implementing any IT solutions, even the most understandable and proven).

What is the use? Automation, acceleration of typical operations, reduction of downtime due to configuration errors, improving the reliability of the network due to the fact that the causes of failure in the network are known instantly. According to Cisco, an IT administrator will save 90 days a year. Separately, security: with the Zero Trust-approach, an epic problem with subsequent entry into the press can be avoided, but for obvious reasons, very few people appreciate this.

Source: https://habr.com/ru/post/457738/

All Articles