Avari Laws for Wi-Fi Reliability

Router Replacement:

Manufacturer A: 10% broken
Manufacturer B: 10% broken
P (both A and B are broken):
10% × 10% = 1%

Replacing the router (or firmware) almost always solves the problem.

Adding a Wi-Fi amplifier:
')
Router A: 90% running
Router B: 90% running
P (at the same time A and B work):
90% × 90% = 81%

The secondary router almost always makes the situation worse.

All wireless networks, be it LTE or mesh networks, fall off sooner or later, but I can bet that your Wi-Fi network is less reliable than an LTE phone connection. At the Battlemesh v10 conference, we all sat in a room with dozens of experimental misconfigured Wi-Fi routers with open networks that may or may not provide Internet access. What makes the network reliable or unreliable?

After several years of messing with these technologies (surrounded by a bunch of engineers working on other problems of distributed systems, which, as it turned out, have the same limitations), I think I can draw conclusions. Distributed systems are more reliable if you can get service from one node OR from another. They become less reliable if the service depends on one node AND on the other. The numbers are combined multiplicatively, so the more nodes you have, the faster the service will fall off.

If you take an example that is not related to wireless networks, imagine the web server’s work with the database. If they are on two computers (real or virtual), then your web application will crash if the web server AND the database server do not work perfectly. In essence, such a solution is less reliable than a system that needs a web server, but does not need a database. Conversely, imagine that you are organizing a fault-tolerant system with two database servers, so that if one falls, we will switch to the other. The database will work if the primary OR secondary server is in service, and this is much better. But it’s still less reliable than if you don’t need a database server at all.

Let's go back to Wi-Fi. Imagine that I have a router from manufacturer A. The Wi-Fi router is usually so-so, so for the sake of example, let's assume it is 90% reliable, and for simplicity, we define it as “it works well for 90% of users, and 10% experience annoying bugs. So, 90% of users with a brand A router will be satisfied and will never change it for anything. The remaining 10% will be unhappy, so buy a new router - from manufacturer B. This one also works well with 90% of users, but the bugs are not interconnected, so it will work for the other 90%. This means that 90% of people with a brand A router are satisfied; and 90% of the 10% who use a brand B router are also satisfied. It turns out the level of satisfaction of 99%! Even though both routers are only 90% reliable. This is because everyone has a choice between Router A OR Router B, so they choose the one that works well and throw out the other.

This applies equally to software (vendor vs openwrt vs tomato firmware) or program versions (people may not upgrade from v1.0 to v2.0 until v1.0 starts to deliver problems). In our project there is a router v1 and a router v2. The first version worked fine for most users, but not for everyone. When the second version came out, we started distributing v2 routers to all new users, as well as those v1 users who complained about the problems. When we brought out the graph of user satisfaction, we saw that it jumped immediately after the release of the second version. Fine! (Especially great, because the v2 router was developed by my group :)). Now update everyone, right?

Actually, not necessarily. The problem is that we distorted our statistics: we updated only those v1 users who had problems on v2. We did not “update” v1 users with v2 problems (of course, there were those too). Maybe both routers were 90% reliable; The above story could well have worked and vice versa. The same phenomenon explains why some people switch from openwrt to tomato and enthusiastically respond, how much more reliable this firmware is, and vice versa. The same goes for Red Hat and Debian or Linux and FreeBSD, etc. This phenomenon “Everything works for me!” Is known in the open source world; simple probability. You need an incentive to go only if you have any problems now.

But the reverse side of the equation is also true, and it matters for a mesh network. When you install multiple routers in a mesh chain, you depend on several routers at the same time, otherwise your network will fall apart. Wi-Fi is notorious for this: one router establishes connections, but it works strangely (for example, it does not route packets), and the clients are still attached to this router, and nothing works for anyone. If you increase the number of nodes in the chain, then the probability of such an outcome quickly increases.

Of course, LTE base stations also have reliability issues - and a lot. But they are usually not organized in the form of a mesh topology, and each LTE station usually covers a much larger area, so that dependence on a smaller number of nodes is formed. In addition, each LTE node is usually “too big to fall,” in other words, it will instantly cause problems to so many people that the telephone company will quickly fix everything. The only faulty node in the mesh network operates only on a small area, so problems will arise only when passing through this territory, although in most situations there will be no problems. All this leads to a vague impression that “Wi-Fi mesh networks are buggy, and LTE is reliable,” even if your own mesh node works most of the time. This is all a game of statistics.

Solution: buddy system
Let a friend say if you become unsuitable.

Router A: 90% running
Router B: 90% running
P (or A, or B work):
1 - (1-0.9) × (1-0.9) = 99%

In the past 15 years or so, the theory and practice of distributed systems have come a long way. Now we basically know how to transform a situation AND into a situation OR. If you have a RAID5 array and one of the disks fails, you take the disk out of circulation, so you can replace it until the other one fails. If you have a NoSQL database service for 200 nodes, you check that requests are not sent to the failed nodes, so that other nodes can take over their work. If one of your web servers is overloaded with redundant Ruby on Rails code, then your load balancers redirect traffic to another node, which is less loaded until the first server returns to normal mode.

The same should be with Wi-Fi: if your router works strangely, you need to disable it before fixing it.

Unfortunately, the performance of a Wi-Fi router is harder to measure than the performance of a database or web server. The database server can easily test itself; just run a couple of requests and make sure the socket requests are in order. Since web servers are accessible via the Internet, you can start one verification service that will periodically poll all servers and signal the need to restart if the server stops responding. But by definition, not all mesh nodes are accessible via a direct Wi-Fi link from one location, so a single verification service will not work.

Here is my suggestion, which can be called "Wi-Fi buddy system". The analogy is this: as if you and your friends went to a bar where you got too drunk and started acting like a jerk. Since you are too drunk, you do not necessarily know that you behave like a jerk. This can be difficult to determine. But do you know who can determine this? Your friends. Usually, even if they got drunk too.

Although by definition not all mesh nodes are accessible from one location, you can also say that by definition each mesh node is accessible for at least one other mesh node. Otherwise it will not be a cellular structure, and you have even bigger problems. This hints at how to fix the situation. From time to time, each mesh node must try to connect to one or more neighboring nodes, posing as an end user, and see if traffic is routed or not. If it passes, then great! We say to this knot that he is doing well, let him continue in the same vein. If not, then bad! We tell this knot that it is better for him to return to the car. (Strictly speaking, the safest way to do this is to send only “you are doing well” messages after a survey. A failed node may not be able to receive messages “your things are bad.” We need a system like a controller that will reboot the node if it has not received the message excellent! "for a certain period of time).

In a fairly dense mesh network — where there are always two or more routes between a given pair of nodes — this translates AND type behavior into OR type behavior. Now adding nodes (such as those that can pull themselves out of the network in the event of a problem) makes the system more reliable, and not less .

This gives mesh networks an advantage over LTE, because LTE has less redundancy. If the base station fails, a large area loses touch, and the telephone company needs to rush to fix it. If the mesh node fails, we bypass the problem and fix it later in our spare time.

A small mathematical example has come a long way!

Isn't that enough for you?
You can see all my slides (pdf) about consumer Wi-Fi mesh networks (including detailed notes from the speaker) from the Battlemesh v10 conference in Vienna or my YouTube presentation:

Note
The so-called "laws" is a special case of more general and therefore more useful theorems of distributed systems. But this is the Internet, so I chose one special occasion and named it in my honor. Come on, try and stop me.

Source: https://habr.com/ru/post/335380/

All Articles

Avari Laws for Wi-Fi Reliability

More articles: