📜 ⬆️ ⬇️

When the router fails to handle the load

I will share one case from telecommunication practice.
We have a tsiska of the 26th series (2620XM). It has about four dozen subinterfaces. Most of them are for local subscribers located in the same building, and there are several links to distant points. Among them are the airport, a brick factory, a ski resort, a state farm. "Yes, this is old junk" - you will say and you will be right, but it happened so historically. However, this is not the point.
And some time ago it turned out that the load is too high. At first it was manifested in some delays when working in the console. Type you type a command, and the letters do not appear immediately but a little with a delay. Then periodically the ping to tsiska from remote points began to increase. The next symptom is sometimes a falling off channel to the Internet (the routing inside the local network worked flawlessly and there were no losses). And in the logs, meanwhile, a terrible picture of the very active use of the CPU. CPU load does not fall below 80%, and most of the time 95-99%. Now ping is getting lost even if you are on the same subnet. The Internet is lame in both legs.


There were two problems: this tsiska performed the functions of a NAT, which loads the router very firmly, and the presence of only one interface on board. This connection scheme is called a router on a stick, because there is only one link from the switch to it, and all subnets are routed to the subinterfaces. It turns out that all traffic - even local traffic within one building passes through one port and all traffic that goes outside goes through it. In addition, the tsiska also transmitted a voice from one PBX to another.
This could not go on. And one fine moment came when the network just lay for most of the day. Even local traffic has practically stopped going. Before the heap, this coincided with an important conference call of one of the subscribers (more than half of the calls were dropped due to the high load) and an equally important presentation of the other. Emergency measures were needed - if the next day everything happens again, the consequences will be simply irreparable - the deadline for submitting the financial statements has come. Not only is the Internet turning, but also the connection with 1C servers and the Galaxy.
The purchase of a new, more productive router would most likely not have been approved, especially since the shelf included a 48-port catalist and another 26th tsiska. Yes, and he does not have time to come physically. The Catalyst 3550 is an L3 switch and copes well with routing. In addition, it is possible to configure L3 interfaces on it, that is, not to tie them to VLANs, but to assign IPs to them.
The second tsiska of the 26th series with two ports, which will allow not to let all traffic through one hole: connect local subscribers to one port, and select the second one for the uplink to the remote border.
The solution is simple:
1) We take out NAT to a separate router. (He will deal exclusively with NAT and white IP routing)
2) We bring all the subinterfaces to the katalist (this will facilitate the transfer of data over the local network and relieve the uplink)
3) The old tsiska remains only for telephone transit.
In the diagram, this can be expressed like this:


Pink shows the route of the packet sent to the Internet; the inner packet with the addressee from the 172.16.0.0/16 network follows the lilac path. Green - the path of the phone call.
That is, if the packet is sent to the 172.16.0.0 mesh, then the catalist routes it according to its table, and the default gateway is the router that proves the nating and if the packet is not sent to the 172nd network, then it runs on a white IP and sent on another logical channel straight to the border. At the same time, the second router simply performs the function of an almost voice gateway — it receives data from the PBX via E1 and sends it to the other tsiska via Ethernet along with the rest of the local traffic.
')
The scheme is not very rational in terms of the use of iron. But as a temporary yet amiss.
This greatly reduced the load on the equipment - the average CPU utilization of the router for NAT is 20-30%, and for catalysts even less. The problems have stopped.
This approach has one major drawback - the catalist is still not a router in the full sense of the word. For example, you cannot hang rate-limit on the logical interface on it, and everything is difficult with the physical one. At best, it will be possible to limit the speed of traffic entering the interface for a specific acl, but who needs it?
Therefore, how beautifully we would not get out of the situation, you need to buy a normal productive router and build a classical scheme.
PS If you are interested in the technical details of the implementation of the new scheme or any questions, I will add them to the topic.

Source: https://habr.com/ru/post/111990/


All Articles