About Amazon Clouds and MPLS Transport

During the development of a single project based on cloud services, Amazon had to face one problem that could not be described in open access - significant delays when accessing Amazon RDS. However, the knowledge of data transmission technologies that I received during my work at one industry research institute helped me to deal with it.

')
So, first the patient’s diagnosis. Developed an IT system with a client application. The system is a server that is hosted in the Amazon cloud as an EC2 service. This server interacts with the MySQL database, which is located in the same region (US-west) on the RDS (Relational Database Service). A mobile application interacts with the server, which is registered through the server and loads some data. During the operation of this application, communication errors are often observed, to which the application displays a popup with the words Connection Error. In this case, users complain about the slow operation of the application. This situation arises among users in the United States and in Russia.

The transfer of the entire system to independent hosting (the most common, cheap) led to a noticeably faster application operation, and the absence of Amazon RDS delays. Search in different forums did not give an answer about the reasons for such behavior.

To collect the temporal characteristics of the process of interaction between the application and the server and the database, a script was written that made a request to the MySQL database 10 times in a row. To eliminate the query caching situation, the SQL_NO_CACHE statement was added to the query. However, the database does not cache such a SELECT query. The query and measurement of its duration is carried out by the following function, which measures the duration of the transaction, taking into account the network delay.

function query_execute ($link) { $time = microtime (true); $res = mysqli_query ($link, «SELECT SQL_NO_CACHE ui.* FROM `user_item_id` ui INNER JOIN `users` u ON u.`user_id`=ui.`user_id` WHERE u.`username`='any@user.com';»); return array («rows» => mysqli_num_rows ($res), «time» => sprintf («%4f», microtime (true) — $time)); }

This script has been uploaded to Russian hosting, and launched. The result was the following (script - Russia, DBMS - Amazon RDS, California).

Start time to aws measure
rows: 10, duration: 2.659313
rows: 10, duration: 0.594934
rows: 10, duration: 0.595982
rows: 10, duration: 0.594558
rows: 10, duration: 0.397052
rows: 10, duration: 0.399988
rows: 10, duration: 0.399615
rows: 10, duration: 0.396856
rows: 10, duration: 0.399138
rows: 10, duration: 0.396113
- Average duration: 0.6833549

For comparison - the same thing, but from another hosting site - in France (script - France, DBMS - Amazon RDS, California).

Start time to aws measure
rows: 14, duration: 1.980444
rows: 14, duration: 0.472865
rows: 14, duration: 0.318233
rows: 14, duration: 0.417172
rows: 14, duration: 0.342588
rows: 14, duration: 0.303614
rows: 14, duration: 0.908241
rows: 14, duration: 1.809397
rows: 14, duration: 0.497458
rows: 14, duration: 0.316923
- Average duration: 0.7366935

This situation was extremely surprised, and forced to delve into the background information about caching SQL-queries. But all the surveys said one thing - the script was written correctly. The situation is interesting. It seems everything is clear, but the very small thing is incomprehensible: a) what is happening, and b) how to at least formulate a request for support?

Good. The next experiment is to run this script locally, on a server in the same Amazon data center in California. Surely the servers are physically nearby, and everything should happen very quickly. We try (script and DBMS - Amazon AWS, California).

Start time to aws measure
rows: 10, duration: 0.024818
rows: 10, duration: 0.009796
rows: 10, duration: 0.006747
rows: 10, duration: 0.005163
rows: 10, duration: 0.007998
rows: 10, duration: 0.006088
rows: 10, duration: 0.009614
rows: 10, duration: 0.007938
rows: 10, duration: 0.008052
rows: 10, duration: 0.007804
- Average duration: 0.0094018

Delays are greatly reduced, but in general the picture is the same - the first request is executed several times longer than the next. And I think that this is the cause of all the problems with the final application. After all, the application communicates with the server information "in one request", which just fall on this very long operation. Accordingly, it is this operation that determines all the "stagnation" of the application.

In order to check what is still getting in the way, the database was migrated from Amazon RDS to a separate hosting (the cheapest). And here a small surprise was waiting - an equal speed of the first and subsequent transactions (the script is Russia, the DBMS is inexpensive hosting in the USA).

Start time to linode measure
rows: 10, duration: 0.018506
rows: 10, duration: 0.017285
rows: 10, duration: 0.011917
rows: 10, duration: 0.011928
rows: 10, duration: 0.027923
rows: 10, duration: 0.011141
rows: 10, duration: 0.072708
rows: 10, duration: 0.011934
rows: 10, duration: 0.007816
rows: 10, duration: 0.008045
- Average duration: 0.0199203

A little distracted - the best situation was when a server itself was installed on this weak, in general, hosting (the script and the DBMS are the same inexpensive hosting in the US).

Start time to linode measure
rows: 10, duration: 0.008159
rows: 10, duration: 0.000344
rows: 10, duration: 0.000317
rows: 10, duration: 0.000309
rows: 10, duration: 0.000269
rows: 10, duration: 0.000282
rows: 10, duration: 0.000260
rows: 10, duration: 0.000263
rows: 10, duration: 0.000303
rows: 10, duration: 0.000297
- Average duration: 0.0010803

Here you can notice a similar situation, but on units of milliseconds, the delays of the OS, and not of the network, already take effect. Therefore, do not pay attention to this.

But back to our situation. For the purity of the experiment, we made another test - we launched a script from this cheap hosting on Amazon RDS. The result is again a long first transaction (the script is hosted by the USA, the DBMS is Amazon RDS, California).

Start time to aws measure
rows: 10, duration: 0.098134
rows: 10, duration: 0.016168
rows: 10, duration: 0.011697
rows: 10, duration: 0.007868
rows: 10, duration: 0.008148
rows: 10, duration: 0.010468
rows: 10, duration: 0.033403
rows: 10, duration: 0.011947
rows: 10, duration: 0.012217
rows: 10, duration: 0.008185
- Average duration: 0.0218235

I did not manage to find any analogs to such oddities, nor to the developers on the Internet. However, in the meantime, looking at these figures, I began to get a vague suspicion that the situation was caused not by the settings of IT systems, but by the peculiarity of data transfer in the IP transport network that connects Amazon data centers.

The following considerations are only my assumptions, in which the timings shown above quite logically fall. However, I don’t have full confidence in this, and maybe the reasons for the big delay are different, and they lie on the water. It would be interesting to listen to alternative opinions.

So, in the telecommunications world, the basis of all communication networks is the transport layer formed by the IP protocol. In general, two methods, routing and switching, are used to transmit IP packets through multiple intermediate routers.

What is routing? Suppose there are four network nodes. From the first to the second, an IP packet was transmitted. How to determine where to send it - to node 3 or 4? Node 2 looks inside its routing table, searches for the destination address (Ethernet port), and forwards the packet in the figure to node 4 (with the address 48.136.23.03).

The principle of IP packet routing

The disadvantages of this method are low packet routing speed for the following reasons: a) millions of IP packets must be parsed and the address extracted from them, b) each packet must be run through the database (routing table) where the IP address corresponds to the Ethernet network interface number.

In large networks, the class of Tier 1 operators requires just incredible processor resources, and still each router introduces a delay.

To avoid this, MPLS (Multiprotocol Label Switching) technology was invented. In simplicity - packet commutation. How does she work? Suppose the transmission of the first packet has begun. On node 1, on top of the IP address, a label has been added to the packet. A label is a 4-byte integer that is easily and quickly processed by processors. Such a packet got to node 2. Node 2 determines from the routing table in the manner described above that it should be sent to node 3. At the same time, a table is created at node 2, in which the corresponding label is assigned to the Eternet interface, looking towards node 4.

IP packet switching principle

Next, node 1 sends the second IP packet, and adds to it the same label that was installed for the first IP packet. When such a packet arrives at node 2, the node will no longer process the IP address and look for it in the routing table. He will immediately remove the label and check if there is an output interface for it.

In fact, by building such switching tables, a “tunnel” is built up at each intermediate node through which further data transfer takes place, without using routing tables.

The advantages of this method are a significant reduction in network latency.

And now the main thing ...

Minus - the first packets are transmitted by the routing method, building up a “tunnel” of tags, and only the next packets are transmitted quickly. It is this picture that we observe above. It is unlikely that a cheap hoster has MPLS equipment. As a result, we see the same duration of the first and subsequent requests. But in Amazon, it seems, the work of the MPLS protocol is just manifested - sending the first request takes 3-5 times longer than subsequent ones.

I think this is precisely the reason for the unstable and long interaction of the application with the database. Although, if the exchange with the database was more intense, then most likely, after 3-4 queries, Amazon would have overtaken cheap hosting. Paradoxically, the technology, designed to significantly speed up the transfer of data, in my particular case leads to unstable operation of the system as a whole.

If we take this version as a working one, the question arises - what about the others? After all, large clients, social networks, do not experience such problems on Amazon. A fair question, the exact answer to which I do not have. However, there are some considerations.

The application I am writing about is now version 1.0. Most likely, and even surely, we now have a non-optimal database query structure. In the future, it will be optimized, and due to this we will get a gain in speed. But now, it seems, one has superimposed on the other.
It is still strange to assume that large social networks are based on Amazon. Most likely, the architecture of such systems is geographically distributed, and information can only asynchronously flow to Amazon data centers. At least, tracing from Moscow to Facebook shows the end point of the Irish server, Twitter - some kind of its own platform, Pinterest - on the Telia network and so on.
Most likely, the static routes are pre-installed between the large nodes of social networks.

In any case, I have no other explanation for the current situation.

And there is a solution. Three weeks ago, the system was migrated to Canadian dedicated servers hosting provider OVH. The problem was solved with this: everything worked much faster and the Connection Error was never seen again.

Source: https://habr.com/ru/post/175023/

All Articles

About Amazon Clouds and MPLS Transport

More articles: