
The Internet is a large and dynamic environment where everything is connected to one another in one way or another and can influence each other. Such dependence, when a small change in one part of the system can lead to a complete change in another, is popularly called the “butterfly effect”. The effect perfectly illustrates how one “successfully dumped boots” can bring down a major service and, at the same time, a couple of other people ... Let's talk about that.
Five years ago, when Wi-Fi in the subway just appeared ...
... it was a phenomenon that divided the life of Muscovites into "before" and "after." At that time, the project was the only one in the world, and everything in it was just as unique: the network structure, the monetization model, user services, approaches to construction and operation.
Almost from the launch of the first Wi-Fi segment in the metro, we now have authorization and our own media portal. We generously experimented with the portal in terms of integration with third-party services, in fact, exploring the possibilities of our business model (“and what if you sell coffee in the subway with delivery to the exit from the lobby ?!”).
')
At first, we actively attracted partners from various areas to work. But almost every publication of a new partner service led to the fall of the latter under load and the need for emergency rollback of changes. Few people on the move can survive thousands of new requests per minute, and some are incapable of this, in principle, due to non-scalable architecture. The existence of such a problem forced us to monitor the performance of partner services, on which user experience directly depends. And also to develop mechanisms to reduce this dependence (proxy, cache).
Once a loud cry in the office of the "Five Hundred" set in motion the whole company - now such situations practically do not occur. On the screen of July 2015, the result of the launch of the service for the sale of flowers with delivery on our sub-domain.But evolution never passes quickly. Before we built the current system, we had to "fill the cones" and on our own experience through a series of accidents. Moreover, the process does not stop: the deeper we dive into the problematic, the more we detect the most unexpected dependencies. Looking back, we understand how important it is sometimes to have an example of how it happens. This is what we want to share.
New iOS dropped traffic by 20%
MaximaTelecom specializes in building transportation networks. The vast majority of subscriber devices that use our network are mobile, smartphones and tablets based on Android and iOS. Both vendors, Google and Apple, have roadmaps for updating their operating systems. In new versions, the modules responsible for connecting to Wi-Fi often change. At best, on the day of the update, we have increased traffic due to the fact that the devices are downloading the update via Wi-Fi. But there are also catastrophic cases.
Just last year, Apple released a new version of iOS 10.3.1, after which traffic from the network fell by almost 20%. It turned out that the new version of Apple "broke" the process of connecting to the network: the authorization mechanics in Captive stopped working and the devices could not log in to MT_FREE. I had to release the fix in an emergency mode and correct the situation. The problem was fixed after three minor updates, after we entered the case in the Apple bugtracker.

The number of hits to the auth.wi-fi.ru authorization page per minute. The graph clearly shows a significant lag from the figures for the previous period.The situation is aggravated by the fact that Wi-Fi is a rather old and extremely widespread technology, during the creation of which it was not intended to be used on such a scale as we have in the Moscow metro. So, we have to deal with the whole "vinaigrette" of various devices, each of which behaves in the network in its own way. Flat metrics of the number of abstract megabytes or “spherical subscribers in the network” are not applicable for us. Any service, be it basic Internet access, a media portal or a mobile application, should be viewed in the context of specific devices and / or operating systems, since the problem may concern a specific and fairly narrow group.
... and a few dozen of the most exotic options.This is not DDOS: the crash of the mobile operator led to a jump in traffic by almost a third
Two years ago, one of the mobile operators had a major accident. In such cases, users are looking for an alternative to the communication service. If we talk about the subway, then there were no alternative means of communication on the trains.
RefinementEven now, only some operators provide services in areas equipped with a radiating cable. But this technology is very limited in capacity and is not able to provide a comparable level of service for a significant proportion of users. Not to mention the cost of traffic on the limit tariff plans.
But at the stations, cellular communication has developed quite strongly, not to mention ground segments, where Wi-Fi directly competes with it.
We learned about the accident at the mobile operator’s network from our dispatcher service, which announced that we were being attacked. The growth in the number of users and traffic was such that at first we thought that we had become DDOS. We learned about the real reasons for the increase in traffic later, finding out that one third of employees do not have cell phones.
That's how it looked for users of our Wi-Fi above the ground.The specificity of our particular situation is that we have a Wi-Fi network, which means that it does not matter to us which SIM card of which carrier is installed in the user device.
Here it is worth making a reservation that the accident that took place affected our service in part and negatively. Some segments of the MT_FREE network, in particular, the network in city buses and commuter trains, use cellular communication as a backbone network, which means an accident on cellular communication networks leads to degradation of service on these segments.
Wi-Fi in the subway without ads? YES!
Advertising is the foundation of free access to the MT_FREE network, because it is thanks to it that the service exists and pays off. We have been using AdFox for many years as a basic AdServer. Interestingly, AdServer itself has not undergone any significant changes in the course of our work with it. One of its specifics is a system for collecting statistics on shows, which is formed by hourly intervals. This causes rhythmic response time peaks from the service (every hour, exactly on the border of the hour, the “twist” starts to “misbehave” and think before each answer). We caught this nuance very not immediately!
AdFox response timeline for ad request. Bursts and dips are clearly visible on the border of the hour.In fact, we observed the same characteristic hourly “peaks” in the number of impressions for other monitoring tools, for the same Metric. But I want to tell you about a more extreme situation. In the winter of last year, a serious accident happened at AdFox: the service did not respond for a long time. On our metrics, this manifested itself as a lack of user authorization and a sharp drop in portal performance. At the same time, the AdFox management interface was not available with a certificate error.
Illustration of adfox.ru certificate error.After conducting a couple of tests and calling AdFox itself, we learned about the accident, and we had no choice but to let all identified users enter the network without ads.
And here is what the accident on Yandex metrics on our portal looked like.Accelerating download sometimes leads to unexpected results.
The perceived quality of our service depends not only on the operation of foreign infrastructure, OS updates and crashes on mass resources, but also on the behavior of specific browsers on specific devices. In this regard, we have much more opportunities to influence, so we are constantly working to improve products. On average, we publish one update per day. But sometimes a seemingly simple update, which should lead to an improvement in user experience, leads to unpredictable consequences.
Since we have the opportunity to influence the work of services at the network level (for example, by changing the priority of one type of traffic relative to another), the idea arose to speed up authorization by prioritizing traffic. We published the relevant changes and with amazement began to observe numerous errors and a 20% drop in advertising revenue. Technical tests showed that the circuit works absolutely correctly from a network point of view. The rollback of changes, however, confirmed that the reason was in the new settings.
As a result, we found that by increasing the priority of some scripts over others, we changed the order of execution of functions at the level of loading the authorization page in the browser itself. This greatly influenced the user experience. In fact, authorization scripts began to load and run faster advertising. Because of the existing dependencies between them, there were situations when one function waits for the result of the execution of another, the file with which has not even been downloaded to the device yet.
Social Media vs Media
The behavior of users on the Internet corresponds to typical patterns. People are used to communicate via instant messengers, search for content on media portals, read news through social networks and news aggregators. It’s pretty obvious, but I’m still focusing on the fact that social networks are an alternative to news, and vice versa. When suddenly something happens to one of the sources of information, the attention of users is redistributed to the remaining, as a rule, the most accessible. So in 2017 there was a global failure in VKontakte. From our side, this event looked like a sharp increase in users and time on our news portal wi-fi.ru. In fact, users, realizing that their favorite social network does not work, went to read the news to us.
The moment of the collapse of the VK was marked by a 30% increase in the load on the portal wi-fi.ru.This case illustrates how important it is for mass services to have a safety margin for “digesting” the consequences of an information “neighbor” accident.
Green color - no accidents
The described situations constantly encourage us to improve monitoring of third-party services in MT_FREE. Here is how the dashboard exploits our network.
Dashboard network operation in St. Petersburg.Dashboards consist of a set of indicators of the “traffic light” type: the green state is OK, the red color is an alarm. The color of the indicators varies with time. This can be both normal behavior and a sign of deviation from the norm. But if you “pull out” all the indicators in line and put each measurement cycle on the board like that, you’ll get a two-dimensional, ever-growing picture describing the evolution of the network as a whole. This picture can easily be “fed” to standard machine learning algorithms created for recognizing graphic patterns (a kind of FindFace, only for sensor patterns).
The color chart of indicators developed in time is nothing more than a picture describing the evolution of the network.Next, self-learning algorithms (such as AI) are added, which are able to automatically classify patterns and identify the causes of deviations or incompleteness of data. Everything looks simple, but how do you think, how many telecom operators really use this?
Few, and we are not among them
For the sake of justice, the application of this technology and within the framework of MaximTelecom itself is at a rather early stage, largely because it is not clear where the boundary is between what needs to be received from outside the network and what can be obtained from within. Our advantage here is only that we began to develop the necessary algorithmic base from the very beginning within our platform for advertising monetization of the network.
Maxima is the operator, first of all, of the free Wi-Fi access service. Moreover, unlike a sufficiently large number of “social” Wi-Fi, we are a full-fledged commercial telecom operator. In fact, this is our corporate idea: we strive to make communication free and at the same time profitable, and we have already proved that this is possible. Almost no telecom operator in the world can (or does not want) this, and therefore does not develop technology for this. This gives hope that in the future we will be able to bring our technologies to a level where the user experience of MT_FREE will not differ from what traditional paid carriers provide. At the same time, the level of reliability will be higher due to a more developed intellectual system of management and operation.
But, unfortunately, not all problems can be solved within the capabilities of one company, if only because there are many manufacturers of subscriber and network Wi-Fi equipment, and the level of unification is significantly inferior to that in cellular networks. Problems with various devices when connecting to the network, we solve since the launch. The “root of evil” is in the absence of any standard, and, as a result, each manufacturer creates something different.
To solve such industry problems, there are international associations. For example, now we are leading in a project to standardize user experience when connecting to Wi-Fi networks using advertising monetization. But this is a topic for another article.
By the way, we are constantly expanding our development staff; current vacancies can be found on our
career page .