Not a single gap: how we created a wireless network for 3000 devices

Wireless Society by JOSS7

Wi-Fi in the offices of Mail.Ru Group over the past ten years has experienced several changes of equipment, approaches to building a network, authorization schemes, administrators and those responsible for its work. The wireless network began, probably, as in all companies - from several home routers that broadcast some kind of SSID with a static password. For a long time this was enough, but the number of users, the area and the number of access points began to grow, home D-Linkʼ and gradually replaced by Zyxel NWA-3160. This was already a relatively advanced solution: one of the points could act as a controller for the others and provided a single interface for managing the entire network. Some deeper logic and automation software NWA-3160 did not give, only the ability to configure connected to the controller points, user traffic was processed by each device independently. The next change of equipment was the transition to the Cisco AIR-WLC2006-K9 + controller with several Aironet 1030 access points. It is already a completely adult solution, with brainless access points and the handling of all traffic by the wireless controller. After another migration on a pair of AIR-WLC4402-K9, the network has already grown to hundreds of Cisco Aironet 1242AG, 1130AG, 1140AG points.

1. Accumulated problems

The year 2011 came, a year later the company was expected to move to a new office, and Wi-Fi was already a sore subject and the most frequent cause of employee complaints to technical support: low connection speed (and video buffering on youtube / vk / pornhub causes serious stress, and obviously hinders work), connection breaks. Periodic attempts to use Wi-Fi-phones failed due to idle roaming. Laptops with built-in Ethernet were getting smaller (thanks to the appearance of the MacBook Air and the race of manufacturers for the thickness of the case), the vast majority of mobile phones already required a constant internet connection.

The air was constantly busy, the old access points could not withstand the load. Disconnects of users started when 25+ devices were connected to one access point, the 802.11n standard and the 5 GHz range were not supported. In addition, for the needs of mobile development in the office was a heap of SOHO-routers connected to various emulators (using NetEm ).
')
From the point of view of the logical scheme, since the transition to centralized solutions in 2007–2008, little has changed: several SSIDs, including the guest one, several large subnets (/ 16), which users authorized in this or that wireless network got into.

Network security was also bad: PSK was the main mechanism for authorizing users to trusted Wi-Fi networks for several years. About thousands of devices were constantly on the same subnet without any isolation, which contributed to the spread of malware. Nominal traffic filtering was performed using iptables on * NIX-gateway, which served as a NAT for the office. Naturally, the granularity of firewalls was out of the question.

2. New height

Moving the company turned out to be an excellent opportunity to think over and build an office network from scratch. Having fantasized about the ideal network and analyzing the main complaints, we managed to determine what we want to achieve:

the maximum available on the market performance of access points. Preferably with the ability to upgrade to the new 802.11 standards without replacing all the equipment;
fault tolerance. Authorization servers, Wi-Fi controllers, switches to which access points were connected, firewalls and routers - reserve everything;
The ability to emulate various network conditions (packet loss, delays, speed) using corporate Wi-Fi. The presence in the office of a set of Wi-Fi soaps without centralized control did not allow the use of the air optimally;
Wi-Fi telephony. The mobility of work phones is convenient for the work of some departments - technical support, administrative department, etc .;
ITSEC. Identification of connected users. Granularity of access lists: only the resources needed for the job, not the entire network, should be available to the connected user. Isolation of user devices from each other;
operation of bonjour and mDNS based services. We have a lot of users of macOS and iOS, and all sorts of apple services like airplay, airprint, time machine are not originally designed to work in large segmented networks;
full wireless coverage of all office premises, from toilets and a gym to elevator halls;
centralized system for locating users and sources of interference in the work of Wi-Fi.

There are several approaches to the organization of a wireless network in terms of control and processing of user traffic equipment:

Placer autonomous access points. Cheap and cheerful - the administrator and installer place inexpensive home Wi-Fi routers around the room, and if possible, tune them to different channels. You can even try to configure them to broadcast the same SSID and hope for some roaming. Each device is independent, in order to make changes to the configuration, you need to pull each point separately.
Partially centralized solutions. A single point of control for all access points is a Wi-Fi network controller. He is responsible for making changes to the configuration of each access point, eliminating the administrator from having to manually bypass and reconfigure all available devices. It may be responsible for centralized user authorization when connecting to the network. The rest of the access point does not depend on the operation of the controller, independently process the traffic of users and release it to the wired network.
Centralized solutions. Points are no longer any independent devices, completely transferring both control and traffic handling to the network controller. All user traffic is always transmitted for processing by the controller, decisions about changing the channel, signal strength, broadcast wireless networks and user authorization are made solely by the controller. The task of access points comes down to servicing wireless clients and frame tunneling in the direction of the wireless network controller.

We managed to try each of these approaches, and a centralized solution with a single controller was most suitable for our new tasks. Together with the controller, we received a single point of application of access lists and untied roaming clients between access points from the address space on the wire.

3. The choice of equipment

At that time (the end of 2012) there were only a few vendors that inspire confidence in us and at the same time have a satisfying line of equipment. In addition to the obvious Cisco, a solution from Aruba came to live testing. The points of the 93rd 105th, 125th and 135th series with the controller were tested. Everything took place in real conditions, with live users: a network was deployed at these points on several floors of the old office. In terms of performance, the points fully met the needs at that time. The controller's software was also quite good: many of the chips, for which, in the case of Cisco, you would need to install additional servers (MSE / WCS / Prime) and purchase licenses, were implemented directly on the controller (geolocation, collection and display of advanced statistics on customers, heatmap drawing and display of users on the map in real time). Along with this, there were also disadvantages:

a non-disabled (or rather, only disabled with the necessary functionality) stateful firewall with a very modest session limit. In fact, the Wi-Fi network managed to be killed from one laptop by running a successful network scan;
the spectrum analyzer in the points was used only for generating alerts to the administrator. Cisco already nominally knew how to respond to interference independently (Event Driven RRM);
MFP was not implemented at all;
unlike Cisco, Aruba points could not be reflashed and used without a controller.

As a result, we had to go back to Cisco solutions: the 5508 controller, the top-end AP 3602i for the main office premises, and the AP 1262 for connecting external antennas. The points of the 36th series at that time were interesting by the possibility of upgrading to 802.11ac Wave 1 by connecting an additional antenna module. Unfortunately, these modules did not become compatible with the points made for Russia with the -R- index, so in order to fully support 802.11ac, you need to change access points to AP 3702 (and 3802 in the future).

There are many step-by-step instructions on the initial setup of Wi-Fi tsiskina on the network, as well as on planning (and starting from the eighth version of the software, most of the best practices on setting up are available directly from the web-ui controller).

I will focus only on the unobvious and problematic issues that I encountered.

4. Fault tolerance

The Wi-Fi network controller handles all traffic and is a single point of failure. No controller - the network does not work at all. It was necessary to reserve it first. For some time now, Cisco has been offering two different solutions for this:

"N + 1". We have several controllers with a fully independent control-plane, a private configuration, IP addresses and a set of installed licenses. The access points know the list of addresses of the controllers and the priority of each of them (primary-secondary-tertiary ...), and in case of a sudden failure of the current controller, the point reboots and tries to connect to the next one in the list. The user remains without communication for a minute or two.
"AP SSO". We combine the primary and backup controllers among themselves, they synchronize the configuration, the state of the connected users and use the same IP address to create a tunnel to access points. When the main controller fails, the IP and MAC address to which the access points were hooked quickly and automatically transferred to the backup (remotely similar to the operation of FHRP protocols). Access points should also not notice a disconnection. In an ideal world, users generally do not feel that something is broken.

The “AP SSO” option looks much more interesting: an instant and imperceptible failover, no need for additional licenses, no need to manually maintain the relevance of the configuration of the second controller, etc. In real life, in the fresh at that time software 7.3, everything turned out to be not so rosy:

both WLCs (Wi-Fi controllers) must physically be close to each other. A dedicated copper port is used to synchronize configuration and heartbeats. In our case, the controllers were located in rooms on different floors, and the length of the copper cable was enough to the limit;
Transparent failover for connected users ("Client SSO") appeared only in version 7.6. Prior to this, users still disconnected from Wi-Fi, albeit briefly;
to put it mildly, the “strange” mechanism for determining the cluster's behavior during an accident. In short: both controllers ping each other once a second over a copper wire and check the availability of the default gateway (again, by ICMP ping).

With the last point and there were difficulties. The essence of the problem - in accordance with the table in case of any incomprehensible situation - the standby controller goes into reboot. Suppose we have the following network diagram:

What happens when the C6509-1 is disabled? The active controller loses uplink and immediately reboots. The backup controller loses connection with the main one and tries to ping the gateway, which for three seconds (with default VRRP timers) will be unavailable until the address is “moved” to the C6509-2. After two unsuccessful pings of the gateway for two seconds, standby wlc will also go into reboot. And twice. Congratulations, for the next 20-25 minutes, we were left without Wi-Fi. A similar behavior was observed when using any reservation protocol of the first transition (FHRP), as well as spontaneous reboot controllers with too much ICMP rate limit. The problem is solved either by tuning the FHRP timings so that the address has time to “move” before the standby wlc restarts. Either by transferring the FHRP master to the router to which standby wlc is connected, or by completely changing the wiring diagram (for example, using MC-LAG / VPC / VSS-PC towards the controllers).

In software 8.0+, the problem was solved by complicating the logic of checking the availability of the gateway and switching from ICMP-pingalki to UDP-heartbeats of its own format. As a result, we stopped at a bundle of HSRP and software 8.2, having achieved the same flicker between the controllers, which is imperceptible to the user.

Also, for fault tolerance, several RADIUS servers (MS NPS) are used, access points within the same room are connected to different access switches, access switches have uplink and two independent network core devices, etc.

5. Tuning

Finding general recommendations for tuning Wi-Fi performance is not difficult (for example, Wi-Fi: non-obvious nuances (for example, home network) ), so I’ll not pay much attention to this. Is that in brief about the specifics.

5.1. Data rates

Imagine that after the basic configuration of the controller and connecting to it the first ten access points on the test floor of the still unfinished building, we connect to the spectrum analyzer and see that more than 40% of the 2.4 GHz air is already occupied. Around a single living soul, we are in an empty building, there are no alien networks and home Wi-Fi routers. The transmission of beacons takes half of the airtime - they are always transmitted at the minimum speed supported by the points, with a high density this is especially noticeable. Adding new SSIDs exacerbates the problem. With a minimum data rate of 1 Mbps, already 5 SSIDs at 10 points in the “defeat” zone result in 100% download of the air exclusively by beacons. Disabling all data rates below 12 Mbps (802.11b) drastically changes the picture.

5.2. Radius VLAN assignment

Large L2 domains are fun. Especially on the wireless network. Multicast clogs the air, open peer-to-peer connections within a segment allow one infected host to attack others, etc. The obvious solution was the transition to 802.1X. Customers were divided into several dozen groups. For each, a separate VLAN and separate access lists were created.

A strong decision in the trusted SSID was denied p2p. For WLAN with authorization by radius, the WLC allows you to combine any number of VLANs into a logical group and give each user the desired network segment. In this case, the user does not need to think about where to connect. In dreams, the final scheme looked like two SSID - PSK for guest users and WPA2-Enterprise for corporate, but this dream quickly broke into harsh reality.

5.3. 30+ SSID

The need for new WLANs appeared immediately. Some devices did not support .1x, but had to be in the subordinated segments. The other part required p2p, while the rest had particularly specific requirements, such as PBR traffic through servers, or ipv6-only.

At the same time, 3602 dots allow broadcasting no more than sixteen SSIDs (and 802.11ac modules, for which there was hope in the future, no more than eight).

But to declare even 16 SSID means to be beaten with a very substantial percentage of the ether.
Ap Groups came to the rescue - the ability to broadcast certain networks from certain access points. In our case, each floor was divided into a separate group with an individual set for each. If desired, crushing can go on and on.

5.4. Multicast and mDNS

The following problem arises from the previous point: devices that require multicast and mDNS (Apple TV is the most common instance). All users are beaten by VLANs and do not see other traffic, and it is somewhat problematic to keep each VLAN on a separate mDNS device. In addition to this, the initial failover svi on routers was implemented using VRRP, which uses multicast, and by default sends the authentication key in clear text.

Connecting to Wi-Fi, listening to traffic, crafting a hello package, becoming a master. Add md5 to VRRP. Now hello packages are protected to some extent. Protected and sent to all customers. Like the rest of the multicast traffic within the segment. In other words:

devices that require mDNS do not fully work with us;
traffic that customers don’t need (and it’s not just the VRRP hello) goes to them anyway.

The solution to the second problem seemed to suggest itself - turn off the multicast in the wireless network. With the first problem at that time (until release 7.4) it was all a bit more complicated. It was necessary to raise in the necessary VLANs the server listening to mDNS requests and relaying them between clients and devices. The solution is obviously unreliable, unstable and not giving a full solution to the problem of the presence of a multicast.

Starting at 7.4, Cisco rolled out mDNS-proxy at the controller level. Now all mDNS requests from a certain “service string” inside (for example, _airplay._tcp.local. For Apple TV) it became possible to send only to interfaces with a specific mDNS profile (and this can be separately configured on each access point, which allows broadcasting requests even from those VLANs to which the controller is not physically connected by connecting only one point there). And this functionality works regardless of the multicast global settings. That allows you to turn off the last and safely discard packets. What was done.

5.5. Multicast again

We turned off multicast. Network load has decreased. It would seem that there is happiness. But then there is one or two clients who still need him. Unfortunately, it was not possible to manage without crutches here. And this crutch turned out to be FlexConnect, which is not intended for this purpose, and in general ...

FlexConnect is a functional that allows you to bind points to a controller, for example, located in a remote office, for centralized management. And the main feature for us in this case will be the ability to implement Local Switching at such points. We need this in order for points to manually process traffic (broadcast SSID, etc.) if connectivity with the controller drops or if we do not want to force all traffic from a point through it.

We create a separate point in the FlexConnect group, create a separate SSID in this group and process all traffic there locally. On the one hand, this is an obvious misuse of functionality, but on the other hand, we have the opportunity to raise small wireless non-filtered L2-domains as needed, without affecting the basic infrastructure.

5.6. Rogue ap

Sooner or later there is a need to protect against the evil twin , because BYOD does not allow to protect the client from itself. All points are embedded in the Beacon frame, which is responsible for belonging to the controller. When receiving a beacon with an incorrect frame, its BSSID is recorded.

Any Lightweight Access Point every given time interval is removed from its channel for 50 ms to collect information about interference, noise and unknown clients and access points. When a rogue AP with an SSID identical to one of the trusted is found, the corresponding entry in the table of "enemies" is generated. Then it becomes possible to either catch the device with a human resource, or to suppress it with the controller. In the latter case, the controller sends to several points that are not participating in the data transfer, sniff traffic from the “twin” and send deauth-packets both on behalf of the clients and all clients on his behalf.

Potentially, this functionality is very interesting and very dangerous at the same time. Incorrect configuration and we destroy all unknown to the controller Wi-Fi in the radius of the coverage points.

6. Conclusion

The article does not claim to be a guide on how to build a wireless network correctly. Rather, these are simply the main problems that we encountered in replacing and expanding office infrastructure.

Now only in the main office we have more than three thousand wireless clients and more than three hundred access points, so some solutions may not be applicable or redundant in other conditions.

PS Did not find any mention of WLCCA on Habré. This is a controller configuration analyzer that indicates some problems as well as giving configuration tips. Invite can be requested here . Fill it with the show run-config output (215,000 lines in our case) and get the output page with the analysis of all the interesting things on the WLC. Enjoy!

Source: https://habr.com/ru/post/312580/

All Articles