Inhuman networks

Once driving was almost an art. At the time of the glorious classics (sixes and fives), it was necessary to know how to clean the carburetor, replace the fuel pump, and what “choke” is.
Once upon a time computers were large and the word “debug” was used in its most direct sense. When the first PCs began to enter our homes, it was important to understand what the north and south bridge is, how to install the driver for the video card, and what value to change in the system registry in order to run the fastidious game.

Today, at least for personal use, even for business, you come to the store, buy a car, press the “ON” button and start using.
Yes, there are nuances - such a trick with System p5 595 or Bagger 293 will not work - we need technical specialists. However, on the basis, even for a company from several branches - you bought a gazelle, several PCs, provided them with Internet access - and you can work.

Some time ago I had a dispute with a person from a network far away. He had a reasonable question: why it is impossible to automate the creation and configuration of corporate networks of small size. That is why he cannot buy one piece of iron in each of his branch, press 5 buttons on each and get a working stable network?
The question was even deeper and affecting personal strings: why such technical support staff (from companies, providers, vendors). Is it impossible to find and fix most of them in automatic mode?
')
Yes, I know many of the arguments that immediately pop up in your head. Such a formulation of the question and I initially seemed utopian. And it is logical that now it seems impossible in the field of communication - the maximum is within the framework of a single home router, but even there everything is not smooth.
However, the question is not devoid of a rational start, and he seized my mind for a long time. True, I have abstracted from corporate SOHO and SMB networks and have devoted my thoughts to provider networks.
From the point of view of a non-technical specialist, it may seem that automatic tuning is more important and easier to implement than troubleshooting. But it should be obvious to any engineer that it is trabschuting that is laying out a yellow brick path - if we don’t know what could be the problem, how can we try to adjust something?

In this starting article under the cut, I want to share my thoughts on various obstacles to the goal and ways to overcome them.

In my opinion, the primary task is to teach the equipment to automatically find problems and their causes. This is what I would like to talk about first of all today.
Next - learn how to fix them. That is, knowing the cause of the problem, it is quite possible to solve it without the participation of humans in most cases.
As a degenerate case of fixing a problem, the setting is a pattern. We have LLD (Low Level Design - detailed network design) and, based on it, the Monitoring System will configure all the equipment - from access switches to High-End routers in the network core: IP addresses, VLAN, routing protocols, QoS policies.

In principle, in one form or another, this is already now - autotuning according to a pattern is not some kind of nonsense. It's just that it’s usually not about the entire network, but about some homogeneous segments, for example, access switches. There is a rigid binding to the command interface and, accordingly, to the manufacturer.
Now this is all implemented in a blank - the script does not do any consistency checking - if there is an error in the design, it will also be on the hardware. There are, of course, configuration validators, but this is another “manual” step.

The maximum task is ambitious - automatic generation of topology, IP-plan, switching tables and, in fact, setting everything and everyone. Maximum human involvement - approve the design, arrange the equipment and forward the cable.

Automatic troubleshooting

There are many systems that increase fault tolerance and reduce loss / downtime of services when there is a problem on the network — IGP, VRRP, Graceful Restart capability, BFD, MPLS TE FRR, and more. But all this is scattered pieces. They are trying to make one into one with varying success, but this does not cease to be heterogeneous entities.
This is reminiscent of the question of the final theory in science, according to which the four types of interaction available are of the same nature. This is a universal, unified theory that explains everything simply and clearly. But while the picture does not add up.

Here is a beautiful illustration of the configured relationship between protocols:

On such a network is running IS-IS, as the IGP protocol. MPLS TE is running on top of it with the FRR function activated.
R1 through BFD monitors R2 status and TE tunnel / LSP status. When the line card restarts on R2 due to a software error, the BFD on R1 instantly reports this to all processes that need to be aware of. MPLS TE refers to FRR and traffic is redirected to R3 in a temporary path.
At the same time, due to the GR functionality, all routes R2, all the corresponding FIB entries, and even the neighborhood relation are saved on R1. At this time, R2 comes back to normal, the board is loading, interfaces are being raised. And on R1 everything is ready and in the shortest possible time he is ready to send traffic back to R2. As a result, services are returned to their former course - everyone is happy, everyone is happy, none of the clients felt the curve hand of a programmer.

But can you imagine how much configuration is required to organize such an interaction? And how often the configuration is incorrect, and the backup works out quite differently, as we wanted it? And what engineers often may not have enough competence to configure such things, and many corporate networks are with minimal configuration services? Everything is still working and thank God! And what a huge number of problems can be detected in the early stages and prevent the moment when it will result in 3 hours of downtime and loss of reputation?

Therefore, we will conduct some classification of problems in order to understand what approaches to them are needed.

Real-Time Critical Situations
Problems that are currently present, but do not affect services
Hardware problems
Potential software problems
Invalid configuration

1. Critical real-time situations

The first - the problems that have arisen due to circumstances - a broken cable, a malfunction of software or equipment. This is essentially the only type at the moment with which developers are somehow struggling. This is what we looked at above. We already have more than a dozen protocols that monitor the status of channels and services and can rebuild the topology based on the real situation. But the problem is that, as I have already noted, these are all elements of different puzzles. Each protocol, each mechanism is configured individually. And we need remarkable abilities in order to cover all this entirely and to have a basic understanding of the work of large networks.

Well, well, anyway, but we can cope with them - there are means for this. What is the problem here besides the complexity itself? I will explain: after everything happened, we either didn’t notice anything (50 ms per eye is not always), or we sort through tons of logs and accidents, trying to establish a causal link of the sequence of events. And this, you know, is not easy, because the surface logs may not be enough, and the detailed ones will contain a lot of uninformative data, for example, the fall of the LSP - by writing to each, the process of rebooting the board, and so on. This should be done not on the same piece of hardware, but on all along the direction of traffic and often even those who stand aside. It is necessary to separate the wheat from the chaff - the logs associated with the accident from those that relate to other problems. And it is good if the network is mono-vendor, and if you have DLink as a CE, PE is Huawei, P is cisco, and ASBR is Juniper ... Well, it remains to sympathize.

What I, in fact, lead. Logs are good, they are beautiful, they are needed. But they are not in a readable form. Even if you have the correct network, with a configured NTP and a SYSLOG server that allows you to view all the accidents on all devices in a truly chronological order, it will take a lot of time to find the problem.

In addition, each device knows what happened to it. Returning to the last example, PE sees tunnel drop, VPN, rebuilding IGP. He can inform the Control System in a human form that, they say, “At 16:20:12 January 1, 2013, all the tunnels and VPNs fell down in this and that direction, through such and such an interface. In addition, the routing scheme was rebuilt. What happened there is not sure, but OSPF told me about the disappearance of the link between devices A and B. RSVP also reported the problem. ”
The intermediate P on which the SFP module burned out says: “At 16:20:12 January 1, 2013, I had a damaged SFP module in port 1/1/1. I checked everything - hardware failure, need replacement. OSPF and RSVP sent a notification to all neighbors. ”

Jokes with jokes, of course, but why not develop a standard or some kind of protocol that will allow a minimal analysis on the device itself and send unambiguous information to the Monitoring System. After receiving data from all devices, assembling them and analyzing, the Control System can give a very specific message:
“At 16:20:12 on device B, the SFP module in port 1/1/1 failed (here is a reference to the type of module, serial number, uptime, average signal level, number of errors on the interface, cause of failure). This caused the following tunnels to crash (list, with links to tunnel parameters), VPN (list, with links to VPN parameters). At 16:20:12 traffic was sent through a temporary path ACC (link to the path parameters: interfaces, MPLS tags, VPN, etc.). At 16:20:14 a new LSP AB was built. ”

2. Problems that are present at the moment, but do not affect services

What are these problems? Errors on interfaces that are still under low load, and they do not make themselves felt. Flapping of interfaces or routes on backup links, too easy passwords, no ACL on VTY or external interface, a large number of broadcast messages, behavior similar to attacks (abundance of ARP, DHCP requests), high utilization of the CPU by any process, lack of black holes routes (blackhole) with configured route aggregation.

Anyway, now many of these situations are monitored and informational messages are written to the logs. ~~Who would track them?~~ However, such things are not paid attention until they are called to the carpet due to the lack of communication among thousands of subscribers. And in the automatic mode, the equipment does not try to find the cause or correct the situation - at best, sending to the SYSLOG server is configured.
It happens, of course, that something is triggered by certain actions — suppressing broadcast packets if their number exceeds a certain threshold, for example. Or disabling the port on which the flapping is observed. But this is all treatment of symptoms - the equipment does not try to figure out what the reason for this behavior is.

What are your thoughts on this situation? Of course, first of all, this is the standardization of logs and traps. Global standardization, at the level of committees. All manufacturers must strictly adhere to them, such as the IP standard.

Yes, this is a huge amount of work. It is necessary to provide all possible situations and messages for all protocols. But one way or another, each vendor makes it individually, inventing his own ways of reporting a problem. So, maybe it's better to get together once and agree once and for all? After all, Martini L2VPN was once a personal Cisco development.
You can send to the control system in this form, for example:
“Message_Number.Parent_Message_Number.Device_ID. Date Time / Time range.Alarm_ID.Optional parameters »

Message_Number - the sequence number of the network failure.
Parent_Message_Number is the number of the parent crash that caused this one.
Device_ID - unique identifier of the device on the network.
Date - The date of the accident.
Time / Time range - The time of the occurrence of the accident or the period of its duration.
Alarm_ID - unique identifier of the alarm in the standard.
Optional parameters - Additional parameters specific to this alarm.

Secondly, the equipment should be able to conduct a minimal analysis of the situation and logs. It should know where the cause is, where the effect is, and in addition to detailed logs, also send the results of the analysis.

For example, if the fall of the BFD, IGP and other protocols was caused by the physical disconnection of the interface, then it should present this in the form of a dependency branch: the fall of the port caused this and that.

Thirdly, the Intelligent Monitoring System of the Network should reflect the accident in human form.

Suppose that after analysis a standardized message was sent to the monitoring server, for example:

"2374698214.0.8422. 10/29/2013 09: 00: 00-10: 00: 01.65843927456.GE0 / 0/0 "
"2374698219.2374698214.8422. 10/29/2013 10: 00: 00-10: 00: 05. 50. 90. R2D2. 70 ”
"2374698220.2374698214.8422. 10/29/2013 10:00:01. 182. GE0 / 0/0. Abnormal Power flow. Power treshold is reached. Abnormal power timer is expired »

The Control System parses these messages into its components:
The accident number 2374698214. Not a consequence of anything. It happened on the device with ID 8422 10.29.2013. Lasted from 09:00 to 10:00:05. Universal alarm ID 65843927456. Additional parameters: GE0 / 0/0.

The accident number 2374698219. Caused by accident "2374698214. It happened on the device with ID 8422 10.29.2013. Lasted from 10:00 to 10:00:05. Universal accident ID 50. Additional parameters: 90. 70. R2D2.

The accident number 2374698220. Caused by accident „2374698214. It happened on the device with ID 8422 10.29.2013 at 10:00:05. Universal accident ID 182. Additional parameters: GE0 / 0/0. Abnormal Power flow. Power treshold is reached. Abnormal power timer is expired.

Then he turns to the database of network devices and retrieves the description of the device with the number 8422.
On-line or in a local copy of the global accident database, it finds the description and significance of the accident 65843927456 - an abnormally high flow of force. The parameter is the interface source GE0 / 0/0.
50 - High CPU utilization. Parameters: total load (90), the most load process (R2D2) and CPU utilization by this process (70).
182 - shutdown interface. In the parameters of the interface number and the reason for which the interface would be turned off.

Further, the Control System forms a clear and comprehensive message:

“An external device was connected to the C3PO switch to the 10GE 0/0/0 interface, which generated an abnormally high power flow from 09:00:00 to 10:00:05.
For this reason, the R2D2 process utilized the CPU at 70% in the time period from 10:00 to 10:00:05. The port was turned off at 10:00:05.
Abnormal Power flow. Power treshold is reached. Abnormal power timer is expired . ”

3. Hardware problems

I do not think, once again, to say that nothing lasts forever, nobody is perfect - the interface cards fail, memory cards acquire bad sectors, the boards reboot spontaneously, programmers suddenly disappear.

It seems to me that if the problem is hardware and somehow solved, the equipment can unambiguously determine the cause - loss of synchronization, control or monitoring bus failure, failure of the board’s power supply.

Some problems are cumulative in nature, others are sudden, but it seems to me that all of them can be traced. Even, for example, a complete disconnection of the line card - the control card, even in the absence of the main power supply, should be able to interrogate the board and identify the problem. If it is not possible to interview, it means either the respondent itself is defective - easy to check, or a replacement fee.

Again, the Control System should receive a message about this:

“The line card in slot 4 lost its synchronization with the switching factories due to damage to the L43F network chip. The fee must be replaced. " And right there on the link generated template for the replacement of equipment.

4. Potential software problems

It's simple. Either the vendor has a good base of software, patches and their descriptions with a list of available features and solved problems, or not. Naturally, if not, we need to yes.

The Control System simply monitors all updates and, if necessary, downloads and installs them.

5. Incorrect configuration

Perhaps this is the most difficult aspect. There is a huge variety of variations. Even the usual IP will cause a storm of emotions when trying to implement its automatic debugging.

To formalize the configuration rules means to create a universal language of interaction between the Control System and equipment. Well, you can not try to scrape together on the same server, scattered data from Juniper, Cisco, ZTE and Dlink. You can not create a parser that will adapt to the data from different devices.

That is, it will be necessary to standardize at a minimum the storage of the configuration and transfer it to the Monitoring System.

As I see it: there should be a block describing the capabilities of the system: what type (switch, router, firewall, etc.), functionality (OSPF, MPLS, BGP). Further there should be sections of the actual configuration. Such a structure should be supported by any equipment from the access switch to the VoIP gateway in the IMS core.

Then you can easily find various inaccuracies: inconsistent parameter settings on opposing devices (for example, BFD discriminators, IS-IS network levels, BGP neighbors, IP addresses), matching Router-ID, unactivated PIM between two multicast routers, etc.

But, in all honesty, these are already non-trivial things and can only be implemented by properly standardizing topologies or formalized LLD (Low Level Design).

Examples from real life for all of the above, I have already cited in this article .

Technical support

In my opinion in this area (as in many others), there is now a huge amount of unnecessary work and overspending of human resources.

We will talk about carrier-level networks, with SOHO and SMB are very different subtleties.

Take for example the procedure for replacing a faulty card.
Now it is the following (with some variations for different vendors):
1) The board is out of order, rebooted, or started to throw strange messages. The customer sees errors in the logs, crashes, but cannot uniquely identify the problem.
2) The customer calls the vendor support hotline, describes the problem in words, or fills out a standard form. Provides data, logs, files collected by slave labor or completely independently.
3) The hotline operator opens the request and assigns it to a group of technicians.
4) The responsible team assigns the request to the engineer.
5) The engineer analyzes the data and eventually sees the same accidents. Connects to equipment, conducts a series of tests, collects information.
6) Often, the engineer is not able to establish the true cause, and he cannot, at his own will, recommend a replacement - escalation of the request to the next level.
7) Depending on the competence of higher-level engineers, the query may travel there for a while. As long as by entering certain commands or analyzing logs and diagnostic information according to a certain algorithm, a hardware fault is not established.
8) The chain of recommendation comes to the responsible engineer and then to the customer.
9) Then follows the procedure for confirming the closure of the request and a different bureaucracy.
10) The customer opens a new replacement request - re-fills the form, again indicates the problem. The call center transfers the application to the appropriate department, the responsible ones are appointed and only then the replacement procedure actually begins.

This is a rather pessimistic scenario, but one way or another, this whole procedure takes a long time and requires at least 4-5 people to do it — customer engineer, call center operator, Timlid group, support engineer, higher-level engineers, employees of the spare parts department .

But in fact, there are algorithms for checking the physical parameters of the boards. Yes, there are many of them, but we will not be cunning, they can be entered into software or even into the hardware of the boards / chassis.
The equipment itself should carry out this analysis, and in the event of a hardware problem, the Control System should issue an unequivocal recommendation for replacement (and, perhaps, independently submit an application for replacement - according to a template). If no known hardware problem is confirmed, the Control System should offer to open a request in the TP. But it is better to fill in the template and register a ticket - the task of the person - to confirm the application.

Similarly for so many other issues.
I can not judge the different vendors, but there are often questions about which software versions are currently relevant, which patches should be installed, what functionality is available in them.
I believe that the Control System should deal with all of this - pumping up software, patches, tracking current known hardware problems, installing patches, and updating firmware. I will describe in more detail in one of the following articles how I see the work of such a system.

Questions about the configuration, the inoperability of some services? Some of these things are fairly obvious and consist either in the incorrect application of the instructions for setting up or in the inconsistency of the configurations on different devices. But such a situation, the engineer TP easily tracks, entering certain commands. Can't the Control System do the same? Analyze the configuration and understand the problem and even fix it?

Socio-psychological aspects

Yes, many engineers, including myself, have a weighty question - what then can we all do if we can be replaced by automation?
I hasten to reassure you - we will all become obsolete, like chimney-sweepers and young ladies on switches.

In fact, this is an eternal question and cause for hassles. Where did the coachmen go with the advent of cars, where did the vast staff serving the first computers go with the advent of compact PCs?
The modern world offers us more and more diverse jobs. In the end, you can get a fuel cell for the matrix.

But attendants and technical support will not go anywhere - there are a lot of problems that cannot be solved automatically due to various reasons (for example, administrative ones). I discussed these and other issues in another article.

Networks must be designed, cables must be laid, the Monitoring System must be monitored, problems solved.

Just our life needs to be made somewhat more reasonable.

A much more important issue is vendor support.
I completely agree with the comments on the article on Nag.Ru , such a system, instead of with standards and superprotocols, nobody needs it now.

Vendors have their own NMS, which they sell for big money (huge, I must say). And if there are such standards, the equipment of one vendor can be simply changed to another and no one will notice . And they need it?

Large operators (and not very large ones) often have self-written systems. These are configuration validators, autotune scripts, and surface problem analyzers.

Engineers are often inert and lazy, or vice versa hyperactive and manually sculpt thousands of lines that will disappear when the hard drive is formatted by the next generation of admins.

Anyway, all this is not that. Absolutely not.

After talking with colleagues, I realized that an incorrect understanding of the idea is taking shape - they say, I want to propose the creation of some kind of software Monitoring System that scripts parse logs, configurations and issue a solution. Moreover, it will have 33 thousand templates for different vendors and different software versions. And this is someone's proprietary solution, created by the will of one enterprising person.
NO Speaking of things more ambitious - global standardization of the communication system between devices. Not the Control System should take care to be able to recognize the logs from Huawei, Cisco, Extreme, F5 and Juniper. This equipment itself must send logs in a strictly defined format.Not a bunch of inert scripts using different protocols (FTP, TFTP, Telnet, SSH) should collect information about configuration, alarms, parameters - this should be a single flexible vendor-independent system.

The other extreme is the SDN paradigm. This is also different. SDN - concentrates not only the monitoring functions - it takes up almost all the tasks of the equipment, except for the actual data transmission - it takes all the decisions on how to transfer this data. No channel to SDN brain - no network.
What I am talking about is the same flexible network with independent devices, each of which is self-sufficient. And the Control System allows you to keep abreast - to know everything that happens on the network, take care of all the problems with minimal participation of people and provide important information in an accessible form.

PS
I do not pretend to complete the consideration of the issue - my level of knowledge is clearly not enough to fully embrace it. These are only reflections.
But I am sure that this is the vector of development of network technologies in terms of service and support. In 50-80 years, everything will change - the networks will cover not only computers, tablets and phones, everything will be on the network. Solid convergence - WiFi, fixed networks, 5G, 6G, telephony, video, Internet, M2M. Everything is clearly not going along the path of simplification, and more and more manpower and resources will be spent on traditional maintenance.
Most importantly, such standards should come up on time. Now is not their time, but it's time to talk about it.

In the course of writing this article, which was originally planned at all as a note, I came to the conclusion that the topic is too interesting for me and there will be another series of articles devoted to this issue:

Control system. Opportunities, principles of work.
Protocols of interaction and interchange of service information between devices and the Intellectual Control System of Work.
Detection and correction of configuration errors.
Automation equipment configuration.

Source: https://habr.com/ru/post/199394/

All Articles