📜 ⬆️ ⬇️

Once again about hash collisions in switches

A tip


We set up one switch from Eltex in test mode - MES5248. And we started to torture him in any way - to configure VLANs, screw up MSTP and use it in every possible way. In laboratory conditions, the visible jambs were not found - they were placed under real live traffic. And then a strange glitch came out, at the level of the elusive Joe - from time to time in the ARP table some records had no value in the port field (yes, they put it in the ARP table. Conveniently). Attempts to catch the tail, to repeat, were not crowned with particular success. Technical support puzzled with themselves and the developers, eventually came to the observation that the records in the ARP table live longer than in the MAC table. And longer than it is configured, which is of course a bug, but not terrible as a deadly one. It remains only to verify that the records in the MAC table die by timeout, a natural death, and are not killed by others during a hash collision.

Well, check it out with your hands, certainly, laziness. No joke - view> 2000 poppies several times in a row and find holes. It is necessary to code. Doing a specific switch is a thankless task, so the code turned out to fit all the switches on the farm. If you're wondering what happened - read the story.

Fairy tale


As a result, a prog was born, which first cleans the MAC table, and then, about once a minute, it merges it from the switch and analyzes it. In this case, are considered

1. Unique poppies, observed since the beginning of testing. (Ever)
2. Unique Macs currently being watched on the switch. (Now)
3. For each poppy, I remember the time when it first appeared in the table. And if he appears in it again after a time shorter than aging time - it is considered that for some reason he was knocked out. (Early deaths)
4. All such poppies are counted. (that before the slash in Early deaths)
')
By the way, I shot the fdb table with telnet, although it is possible with snmp, which was already written on Habré , unlike UserSide results, along the way there were switches that td gave fdb wildly for a long time, and for snmp - instantly , for example - DXS-3326GSR.

Observations are carried out for 20 minutes.

To my great surprise, the switch, for which everything was started, proved to be surprisingly good - from the time before aging time to Ever did not differ from Ever and was empty in Early deaths. Cisco 3750s behave similarly.

But the rest of the zoo problems still showed up. Although most of the tested dlink-s have a smaller MAC-table, the total number of addresses is often much smaller. As a result, such a graph was drawn - depending on the number of problems on the total size of the mac-table.



Actually, what conclusions - different switches of one vendor behave differently and the number of problems grows very quickly from the number of poppies. Organ-output operational - reduce the number of vlans by 3526. Increase aging time on all equipment.

Further - apparently, switches are dealt with with hash collisions in principle and in different ways. Some - throw out old poppies, others - probably, somehow bypass the collision when the table is loose. How exactly they do it is not clear to me - they either consider the address by the second hash function, or they put the new address in the next cell, or the traffic to the new poppy is stupid, or something else. If the representatives of vendors read this and this information is not secret - share it in the comments.

The plans are to finalize the program before data collection and management via SNMP, detach it from internal dependencies, put it in public. Accordingly, the question to the community - is it worth it at all?

Another interesting question is: what thoughts are there, how to remotely catch the situation with broadcasts on new poppies and estimate the amount and / or volume of such traffic?

Source: https://habr.com/ru/post/267965/


All Articles