Good day,%% username!
I want to tell an instructive story that happened today at my work. I work in a very well-known company providing, among others, the services of access to the world wide web. And the essence of my work is to maintain the normal operation of the data network. This network is built according to the classical structure of the Core, Aggregation, Access. Access switches are approximately half the production of D-Link, the second (most) part of Huawei. The management of the entire network iron is rendered into a separate wilan, through which it is monitored.
And today in the morning something wrong was happening. The control system and monitoring of iron began to throw out "footwigs" of the events "switch *** offline" - "switch *** online." Moreover, these messages came in the network segments in which Huawei switches were installed. A quick look at the state of storm control and interface load on the aggregation did not give anything, did not say anything and the logs. The day promised to be fun ...
')
The call from the network monitoring service did not add joy - they brought an incident about the loss of home nodes. At the same time, there were no mass complaints from customers about restrictions on receiving services. It was even possible to find a client in the problem segment, which responded to the ICMP with encouraging 0.8 milliseconds. Attempts to enter any switchboard via telnet were akin to torture: the connection either fell off due to a timeout, or it took minutes to wait for a reaction to the login / password input and to commands. Desperate to look at the log of the “half-dead” switch, I, having suffered pretty badly, reset my conscience to clear my conscience. Seconds 10 after the start, the switch was alive, cheerfully responding to ICMP requests, but then "pings" in the eyes began to take completely indecent values ​​of 800-1000 ms, and then disappeared altogether.
Then it began to come to me that the processors, by no means high-performance switches, are clearly loaded with something and, apparently, at 100%. By running tcpdump on the monitoring server's vlan-interface, I found the reason for the high CPU utilization on the switches. An abnormally large amount of ARP traffic in the control traffic is several thousand packets per second. The reason is found, but here's how to find its source? It was decided to block the control on all ports of aggregation and then in turn unlock it back until the problem segment is found.
I managed to do this operation on only two nodes of aggregation, when suddenly this whole whistle dance stopped. But it seemed to me very suspicious that the minute before my colleague, who was sitting at the next table, took out the network patchcord from the switch port that served to access the equipment and configure it. I asked a colleague to reconnect my laptop to the network - after 10 seconds, "pinging" to the switches again took off to ugly values. The source was found, but this laptop has been used for months to update software and configure network equipment, what could have happened to it?
For a start, it was decided, although the installed antivirus was present, to scan for tools and malware from the doctor and the laboratory. Nothing significant was found. We tried to boot into Linux - the network was silent, no flooding. Back loaded Windows - lasting effect, immediately Vilan filled with ARP flood. But just yesterday, the laptop was all right! And then for some reason I got into the settings of the network card ... My colleague is not often involved in setting up the hardware and updating the software on it, so he could not remember the mask and gateway values ​​for the control network. And he made an annoying error in the configuration of the network card - instead of 255.255.224.0 for the subnet mask, he indicated 255.255.254.0!
But what struck me even more was the fact that despite the obviously wrong configuration (the gateway was outside the network segment due to an incorrectly specified mask), the OS resignedly swallowed this nonsense! Turning the laptop into an ARP traffic generator. This is what happened in the ipv4 protocol settings:
IP 10.220.198.111 255.255.254.0 10.220.192.1
With this mask, the subnet is limited to IP addresses 10.220.198.1 - 10.220.199.254 and the gateway 10.220.192.1 lies outside this range. The operating system should not allow the assignment of a gateway address from another network. This is an obvious bug!
I would be grateful if someone took the trouble to explain the mechanism of ARP flood in this situation, on my own behalf I would like to wish all the network specialists to be attentive and attentive again. As they say - measure seven times, cut once.