What network speed does the provider provide? Actually

If you administer a distributed network, then you probably periodically encounter the task of estimating the actual bandwidth of the channels (VPN) between the offices. In general, this can be done in three ways. One is very simple, but unsuitable for monitoring channels on a regular basis. This use of Iperf, etc. The other two can be used constantly, but so complex and expensive that only large and very large companies can afford. This category includes solutions based on network traffic analysis in real time (network sniffing) and solutions that mix test traffic into user traffic and evaluate the passage of test traffic. And what should everyone else do, those who need to control the quality of the channels, but for objective reasons do not belong to the Central Asian ultra-expensive solutions? In this article I will talk about the new technology for managing network quality. It is incomparably simpler and more accessible than solutions based on mixing test traffic and network sniffing, does not require the installation of hardware probes, but at the same time allows you to manage the quality of a distributed network on a regular basis.

This is a purely pragmatic concept, developed on the basis of the tasks arising in practice. The novelty of the described technology is that the load testing method (used in Iperf, Chariot, etc.), usually used for one-off measurements of network speed, is integrated into the network monitoring system. Hence the name of the technology - Load Monitoring Network (NMS). In short, the idea of the NMS is to automatically conduct load testing at those times when internal network users are not active (these moments are determined automatically). Network throughput is measured at the TCP level. The main field of application of the NMS is the quality management of VPN connections, for example, as part of Service Level Management.

')
First, a few words about network quality, in particular, the quality of communication channels, which methods for quality assessment are most often used, and what are their advantages and disadvantages. (Hereinafter, the terms "network" and "communication channel" will be used as synonyms.) The quality of a network is its applicability for specific purposes. The network works well if the required end-to-end response time, the required voice quality (R-Value, MOS), etc. are provided. In other words, network quality is the fulfillment of the requirements for the availability of IT Services. It is IT Services, not just channels or equipment. High availability of network equipment, low port utilization, no data transmission errors, etc. - this is not a guarantee that the network ensures the required availability of IT Services. Low utilization of ports may be due to delays on the equipment of the provider, causing retransmission of packets after timeouts, packet loss may not be accompanied by errors on the ports of network devices, etc. This means that with the help of a standard monitoring system that supports SNMP and WMI, the quality of network performance cannot be measured, as a rule. Such systems are not intended for this.

In general, when it comes to measuring or assessing the quality of the network, you need to specify what exactly is meant:

Network quality monitoring , i.e. network quality assessment DURING operation. Two technologies are usually used for monitoring:
Network testing (load testing) , which are usually used before the start of operation. For example, at the stage of commissioning or during the acceptance tests. Testing is usually a strong impact on the network and an assessment of the network's response to this impact. Most often, the impact is the transfer of large data arrays, and the reaction evaluation is the measurement of the time spent on it.

Note: In this case, we are only talking about technologies designed to monitor the quality of the network. Therefore, SNMP, WMI, Transaction Simulation, Application Instrumentation and others are not considered.

Test packet generation (mixing test packets into working traffic) is a continuous (against the background of running applications) transmission to the network of certain network or transport layer packets and measurement of delays (delay), variations of delays (jitter), packet losses (packet loss). ) etc. The quality of network performance is judged by the way test packets pass, the intensity of which is usually low (so as not to interfere with the operation of applications). According to the results of such measurements, integrated indicators of network performance are synthesized: R-Value, MOS, etc. The IP SLA technology supported by Cisco Systems equipment is based on this method.

Advantages of the test packet generation method:

High measurement accuracy. The transfer of test packets and all measurements are usually performed at the hardware level. For example, IP SLA allows you to measure one way delay (delay in one direction), and not just a round trip delay, such as ICMP or cheap samples that mimic under IP SLA.
The ability to assess the quality of the network in the absence of user traffic. Even if there are no active users on the network, you will always know whether the network is working well or not.

There are limitations:

Relatively high cost if the method is not supported by channel-forming equipment. If the entire network is built only on the basis of equipment that supports IP SLA, for example, based on Cisco Systems, then everything is fine and hardware probes are not needed. In all other cases, special hardware probes must be installed on the network, which must also be serviced.
The difficulty of assessing the potential of the network. The processor that is onboard the network device, as a rule, does not have enough performance to fully load 100 Mbps or, moreover, 1Gbps or 10Gbps channel.
The complexity of interpreting the results obtained and, as a result, the complexity of using for the PRACTICAL quality management of the received service (Service Level Management). Suppose you want to evaluate the performance of a network service provider (NSP, Network Service Provider). If this is a VoIP network, then everything is simple, because The threshold values of jitter, delay, packet loss metrics for different types of codecs are well known. And if, for example, 1C: Enterprise or SAP CRM is still used in the network? You know what metrics in this case should be measured and what should be their threshold values to ensure the required values of Service Objectives (SLO), Service Level Targets (SLT)? Of course, they can be defined - to build a baseline, to conduct a correlation analysis with the response time of business applications, etc. But it is difficult, expensive, and in practice very few people do it.

It can be argued that today it is the de facto standard for managing the QoS (Quality of Service) of IP networks. The generation of test packets is actively used by both large telecommunication companies and large corporate clients. However, in the Enterprise sector, this method is gradually losing ground to the network sniffing method, primarily due to the inability to control QoE (Quality of Experience, the quality of business applications through the eyes of users).

Network sniffing is the capture and analysis of all packets passing through the network and extracting data from the data link layer (errors, load, etc.), transport level (delays at the client, network, server, lost packets, zero window size, etc.) p.) and application level (reaction time, MOS, etc.).

Advantages of network sniffing:

Ability to control the quality of the network with reference to QoE.
Ability to control network quality (QoS), actually received by each network user.
Highest accuracy.
The ability to not only see but also reproduce the situation on the network that occurred in the past, for example, when a user complained about the slow work of a business application (Retrospective Network Analysis, RNA).

Disadvantages of the network sniffing method are almost the same as in the test packet generation method:

Inability to assess the potential of the network.
The high cost of professional tools. The only exception is free WireShark, which can hardly be attributed to the professional toolkit.
The complexity of interpreting the results obtained and, as a result, the relative complexity of use for Service Level Management.
An additional limitation: the inability to assess the quality of the network in the absence of user traffic.

However, today it is network sniffing that today is the de facto standard for assessing the quality of network operation inside the data center, as well as for organizing QoE (Real User Monitoring) monitoring.

For network testing , simpler methods and means are usually used. (The exception is network testing for suitability of voice and video transmission, for example, PESQ (Perceptual Evaluation of Speech Quality), but this is a separate topic.) The main method here is to transfer large amounts of data and measure network bandwidth. A transmitter and a receiver of data are installed in the network, between which the transfer of data arrays of a known volume is performed. The main measured indicator is the network throughput (throughput), which is also the speed, which is calculated as the ratio of the amount of data transferred to the time spent on it. It is believed that the higher the network bandwidth, the better its quality. By changing the volume and composition of the transmitted data, TOS, protocol and other parameters, you can evaluate the quality of the network for various business applications.

There are many different tools designed for network testing, both free (Iperf, speedtest, etc.) and commercial (FTest, Chariot, etc.). The advantages and disadvantages of most of these tools are directly opposed to the advantages and disadvantages of monitoring systems (IP SLA, network sniffing). If monitoring systems, as a rule, do not allow assessing the potential capabilities of a network, then testing tools are intended for this purpose only. Monitoring systems are usually difficult to use. Testing tools, on the contrary, are simple and clear. But the main advantage of the testing tools is the SIMPLICITY OF INTERPRETATION OF RESULTS.

As mentioned above, determining the threshold values for metrics measured by monitoring systems is often very difficult. When testing a network, you always know for sure which result means good quality of the network, and which means bad. For example, if you are testing a 100 MB channel at the TCP level, transferring large amounts of data for this, then the overhead (preamble, headers, synchronization) will be about 10% –12%. Therefore, if the network bandwidth is at the level of 88Mbps – 90Mbps, then there is no loss and the network works well. The smaller these numbers, the worse the network works. Therefore, if the provider says that he gave you a channel of 100 Mbps, and the effective bandwidth is 88Mbps – 90Mbps, he does not deceive you. But if the effective bandwidth is 60 Mbps, then this is a matter of conversation.

The main limitation of most testing tools (Iperf, Chariot, etc.) is the impossibility of using them in continuous mode, in particular, during network operation. You can ask users not to work, and measure network bandwidth once, twice, ten times, but you can't do it all the time. Network for work, not for testing. If you test the network while the users are working in it, then, firstly, you will interfere with the users, and secondly, the test results will not be reliable, because they will be affected by user traffic.

As follows from the above, testing and monitoring complement each other. In this case, testing is carried out in the absence of working users. Monitoring is the opposite when users are working. But you can act differently. There are always time intervals when users are inactive. If these intervals are determined automatically and at this time the network is tested, and the test results are sent to the monitoring system, we will simultaneously solve two important tasks:

We will ensure high reliability of test results. Usually, conclusions about the quality of the network are made from the results of literally several measurements; in our case, conclusions will be made on the basis of tens of thousands of measurements. This will allow you to see the dynamics of network quality changes, the dependence of quality on the days of the week, time of day, etc.
Expand the functionality of the monitoring system. In addition to the metrics that characterize the availability and health of active network equipment and servers, we get metrics that characterize the effective network bandwidth. Comparing equipment health with network bandwidth makes it easier to diagnose hidden defects and network bottlenecks.

I called the integration of load testing with the monitoring system Stress Monitoring Network.

Load Network Monitoring (NMS). How it works

Thus, the idea of the NMS is as follows:

The periods of time when the network is not loaded with user traffic are automatically determined, and during these periods load testing of the network is automatically performed (measurement of effective throughput at the TCP level).
The network throughput (measured through load testing) is added to the number of metrics monitored by the monitoring system.

NMS can have many uses. But it is best suited for managing the quality of leased communication channels, for example, between the data center and remote offices. Let's see how it works.

Figure 1. Network Load Monitoring. Solution architecture.

To carry out the NMS, a monitoring system is installed in the network, including the Management Console, Probes and Responders. Network testing is performed by a special Test running on the probe. In addition to load testing, the Probe monitors the operation of network equipment (utilization, errors, etc.), the results of which are transmitted to the Console. Multiple Tests can be run simultaneously on the Probe. Some Tests perform equipment monitoring. Others are network load testing. The parameters of all Tests are set from the Console. For Tests that perform network load testing, these parameters are:

Traffic generation mode:
The size of the data block exchanged between the Probe and the Responders.
Data transfer direction:
Generation Schedule:
Extra options:

In the process of load testing, the following metrics are measured:

No	Characteristic	Description
one	READ (Mbps,%)	Network bandwidth when transferring data from Responder to Probe. In all cases, the absolute and relative (relative to the set value) throughput is measured simultaneously.
2	WRITE (Mbps,%)	Network bandwidth for data transfer from the Probe to the Responder.
3	RD-WR (Mbps,%)	The network bandwidth in the on-line data transfer between the probe and the responder.
four	TOTAL (Mbps,%)	Total network bandwidth for simultaneous data transfer between the probe and several responders. Depending on the direction of data transmission can be: TOTAL READ, TOTAL WRITE, TOTAL RD-WR.
five	AVERAGE (Mbps,%)	The average network bandwidth for the alternate data transfer between the probe and several responders. Depending on the direction of data transmission can be: AVERAGE READ, AVERAGE WRITE, AVERAGE RD-WR.
6	Responder Availability (%)	Availability of UDP Responders. Check availability of Responders may be disabled.
7	TCP Link Availability (%)	TCP channel availability. A TCP channel is considered unavailable when it is not possible to establish a TCP connection with the availability of a UDP Responder and the connection between the Responder and the Probe is broken during data transmission.

In order to test the network only when the internal users are not working on the network, the Controller program is installed on the probe. Its purpose is to constantly monitor whether it is possible or not to perform load testing at the moment (to generate traffic). The test will start generating traffic only if the Controller says “You can”. If during the generation the Traffic Controller says “No,” the generation immediately stops. The traffic controller uses in its work the standard functionality of the network monitoring system. In the simplest case, the results of monitoring by SNMP of congestion of the channel-forming equipment.

Figure 2. Managed Traffic Generation

Suppose a communication channel is tested that is connected to the 6th port of the router; see figure 2. At the same time, the Probe is connected to the 1st port, and users to the 4th port. Suppose the Test is to send 10 MB of data from Responder to Probe from 9:00 to 20:00 every hour.
The test constantly monitors the signal of the Traffic Controller and starts generating traffic only if the Traffic Controller says “You can”. And this will happen only if the utilization of port 4 is less than a certain value, for example, 3%. If at the time when traffic generation is to start, the Traffic Controller says “No”, the Test will wait a certain time. If during this time it does not wait for the “Can” signal (reduction of utilization to 3%), then the traffic generation will be postponed until the next hour. Having started to generate traffic, Test continues to monitor the Signalman's signal, and if he sees the “No” signal (port 4 utilization is above 3%), he immediately stops generating, fixes the conflict and cancels the results of this measurement.

The traffic controller works in the background and, thus, he always knows whether it is possible or impossible to perform traffic generation at a given time (even if it is not asked about it). The conditions for issuing “Can” and “No” signals can be different (not only port utilization). This may be, for example, the number of active connections to the database or the number of active users of a business application. When the Test is running, the Controller can be turned on or off.

Two applications of Network Load Monitoring

Depending on whether the Controller is on or off, the NMS allows to solve two different tasks:

Network Quality Management , if the Traffic Controller is turned off.
Audit of the quality of services received , if the Traffic Controller is included.

Both tasks can be solved simultaneously, since There can be several probes and several Tests can be simultaneously performed on each Probe. In some Tests, the Controller may be turned on, in others, it may be turned off.

Registrar included: Audit of the quality of services received

If the Traffic Controller is enabled, the NMS is a tool for auditing the quality of the services received. Network bandwidth, measured at the time when users were not working, uniquely characterizes the quality of the provider's core network.

Knowing the nominal bandwidth of the channels at the physical level, and comparing it with the measured bandwidth at the TCP level, you can easily get ahead of which channels work well and which ones are bad. If Ethernet links are tested, their TCP bandwidth should not be more than 10% –12% lower than their physical speed. Figure 3 shows an example of a report in MS Excel format, from which it is immediately apparent that the Moscow-Perm channel works worse than others.

Figure 3. Communication bandwidth report (automatically generated)

This solution has another important advantage. It significantly increases the efficiency of diagnosing network failures performed by other methods, for example, network sniffing. To quickly determine the cause of the failure, you need to see how it is and know how it should be . When you analyze the network failure that occurred during the period of user traffic, you see it as it is , but you don’t always know how it should be . When you analyze the failure that occurred while there was only test traffic on the network, you not only see it as it is , but you always know how it should be. This greatly simplifies the diagnostic process.

The traffic controller is off: Network Quality Management

If the Traffic Controller is turned off, then the NMS is a network quality management system. An indicator of network quality is the metric “Network bandwidth at the TCP level”. Other metrics (utilization, number of errors, etc.) receive additional status. If the bandwidth deteriorates, they will help determine what this may be connected with. The advantages of this solution include, first of all, the simplicity of network quality management, the equipment of which does not support IP SLA. If the network fails, the network bandwidth will inevitably decrease, and you will learn about the failure before users start contacting the Service Desk, and can quickly determine the cause (if it is local).

The figure below shows a screenshot of the online network monitoring console. The top chart is the estimated network bandwidth. The three bottom charts are estimates of the quality of equipment operation, which can affect network bandwidth. All four scores are tied to a single timeline.

Figure 4. Screenshot of operational monitoring of the quality of the communication channel.

In this case, in order not to create large traffic and not interfere with the work of users, the Test needs to be set up for periodic (for example, every 15 min.) Transfer of a dataset of small size, for example, 2 MB. To do this, set the Network Monitoring mode in the Test parameters.

Conclusion

Today, there is much talk about Service Level Management, but the cost-effective and efficient tools that could be used to continuously evaluate the quality of work, for example, leased communication channels, are not many. Load Monitoring Network is an example of such a toolkit. It must be admitted that the results obtained with its help in resolving legal disputes will not have legal force. But using TCP bandwidth as a metric (along with availability) prescribed in SLA would be correct.

Source: https://habr.com/ru/post/211380/

All Articles