Interaction with systems that respond to user requests quickly (within 100 ms) is felt by the same user as smoother and more natural than interaction with systems that react for a
long time . The development of the Internet and the emergence of datacenter-scale computing systems (
warehouse-scale computing systems ) made it possible for web services to appear that have a high response speed and, at the same time, process terabyte datasets located on thousands of servers. For example, the
Google search engine updates the results of a query in response at the same time as the user types, the system predicts the most probable query based on the printed prefix, and searches and displays the results for tens of milliseconds. Augmented reality devices (such as
Google Glass , for example) that have just begun to enter the market, will need web services with an even faster response rate in order to guarantee a natural interaction.
For the creators of such web services, “shortening” the tail of latency distribution (tail of latency distribution) at the same time as the complexity of the system grows and the number of users increases — a real challenge. Temporary jumps in the delay time (not affecting medium-sized systems) can degrade the performance of a large system at high loads. As well as the principle of building fault-tolerant systems (fault-tolerant computing) is to create a reliable whole from unreliable parts, large-scale web services should create a predictably responsive whole from less predictable parts; we call such systems “latency tail-tolerant” or simply “tail-tolerant” (“tail-tolerant”). The following are the most common reasons for large delays in large-scale services and describe methods that can mitigate their impact on system performance. In most cases, tailing-proof systems can use resources already deployed to achieve fault tolerance, which ultimately reduces additional overhead.
Where does variability come from?
The variability of response time, leading to large tail delays (long tail delay distribution) in the individual components of the system, can occur for the following reasons:
Shared resources
Machines can be shared by several different applications struggling for shared resources (CPU, processor cache, memory, network), plus, within the application itself, there may also be a struggle for shared resources;

')
Daemons
Background daemons can and use, on average, a limited amount of resources, but at the time of scheduling (when scheduled) they can generate multi-millisecond jumps;
Global shared resources
Applications running on different machines can compete for global resources (switches, shared file system);
Operational activities (maintenance activities)
Background activity (periodic compression of logs in
BigTable , garbage collection in languages ​​with garbage collection, etc.) can cause periodic jumps in the delay time;
Queues
Multi-level queues in intermediate servers and switches only increase variability.
Iron also affects variability:
Power limitations
Modern CPUs are designed to be able to temporarily work on their average power range, while struggling with overheating due to the skip mode (
throttling ), if such activity lasts a long time;
Garbage collection
SSDs give us very fast random access to read, but the need to
periodically assemble a large number of data blocks can increase the read delay literally 100 times, even with a sparing level of activity for writing;
Energy Management
Energy-saving modes in many types of devices really allow you to save a certain amount of energy, but they also increase the delay time when the device switches from inactive to active mode.
Variability at the level of individual components, increasing with scaling
A common technique to reduce latency in large-scale services is the parallelization of intermediate operations on multiple machines, with each intermediate operation “co-located” with a portion of a common large data set. Parallelization occurs by fanning out the request from the root server to the leaf servers, and then merging the responses of the request-distribution tree. These intermediate operations should be carried out for a strictly defined period of time, so that the web service can be felt to be responsive.

The variability of the distribution of delays of individual components is enhanced throughout the system; For example, consider a system where each server in 99% percent of cases is responsible for 10 ms, and in the remaining percentage - more than 1 second. If the user's request is processed by one such server, then only one of the 100 responses will be slow (more than a second). The picture below shows how the delay at the server level in this scenario depends on the (very low) proportion of long queries. If a user request should collect responses from 100 such servers, then 63% ((1 - 0.99 ^ 100) * 100) user requests will take more than 1 second ("x" on the chart). Even if a single-second response happens only once per 10,000 requests, for service with 2,000 such servers, almost every fifth user will wait longer than a second (“o” on the graph).

Table 1 shows the measurements taken for one of the Google services, the logic of which is similar to the above scenario. The root server distributes the request through intermediate servers to a very large number of leaf servers. The table shows the dependence of the branching system and the distribution of delays. The 99th
percentile of delays for one random request, measured on the horse server, is 10 ms. However, the 99th percentile of delays for all requests is 140 ms, and the 99th percentile of delays for 95% of requests is 70ms, i.e. waiting for the slowest 5% of requests introduces half the delay time to the 99th percentile. Methods that focus on these most sophisticated queries can have a significant effect on the performance of the entire system.

Proper resource allocation and careful system design can have an effect on all levels and in all components of the system, weakening the influence of the main causes of variability of delays.
Elimination of delay variation in individual components
The following engineering solutions can help:
Dividing services into classes and implementing high-level queues
The division of services into classes can be used in the case when the priority of sending will be given to requests that are waiting for the user. It is also necessary to keep low-level queues short so that high-level policies take effect more quickly (take low-level queues short so higher-level policies); for example, servers that are responsible for storing data in the cluster file system Google support for some operations their own queue of disk requests, bypassing the operating system queue. Such queues allow the server to handle high-priority requests from the user earlier than long requests to process large chunks of data (batch operations).
Reducing blocking due to head queue (head-of-line blocking)
High-level services can handle requests of a very different “weight”. Therefore, it is sometimes useful to break “heavy” requests into many requests “easier” (short execution time) so that other requests can also be fulfilled; for example, Google search engine uses this technique and it allows not to give a small number of computationally expensive queries to introduce significant delays in the execution of a large number of cheap queries.
Managing background activity and synchronization of deviations (synchronized disruption)
Background tasks can create a significant load on the CPU, disk or network; for example, log compression or garbage collection. Skipping cycles, dividing heavy tasks into small ones and challenging such tasks during periods of low total load are capable, when used together, to significantly reduce the effect of background tasks on delays in processing user requests. For large branched systems, in some cases, it may be useful to synchronize background tasks running on different machines. Such synchronization gives a short jump in activity on all machines at the same time, slowing down only those requests that occurred during this jump. And without synchronization, background tasks will be constantly launched on various machines, increasing tail delays for all requests.
Nowhere above mentioned caching. Although caching can be very useful, and for some systems it is simply necessary, it does not directly fight tail delays, except for configurations where all the working data is in the cache.
How to live with variability of delays
The engineering approaches described above are integral when building high-performance interactive services, but the scale and complexity of modern web services does not make it possible to completely eradicate the variability of delays. Even if some ideal behavior can be achieved in isolated systems, in real systems with shared resources, performance fluctuations are beyond the control of application developers. Therefore, Google believes that it is very useful to develop tailing-resistant methods that will allow you to disguise or simply bypass certain types of deviations in the delay time, instead of trying to get rid of them all at once. We divide such methods into two main classes. The first corresponds to the instant response techniques that occur during the execution of the request (within-request immediate-response techniques), i.e. to technicians operating on a scale of tens of milliseconds, before the more serious and long-lasting methods came into effect. The second corresponds to the more long-term techniques that operate on the scale of several requests (cross-request long-term adaptations), i.e. they take dozens of seconds, or even minutes, in time and are intended to level long-term deviations in the delay time.
Short-term measures in a single request
A large number of web services use data replication in order to provide additional throughput capacity and availability in case of equipment failures. This approach is especially effective when most of the workload is reading loosely consistent data sets (loosely consistent datasets); for example, spelling correction service, which updates its data once a day, but at the same time serves thousands of requests per second. Or, for example, distributed file systems may have several replicas of a single data chunk, each of which can be used to serve read requests. The methods below show how replication can be used to reduce the delay time variability within a single high-level query:
Hedged requests
A simple way to reduce the variability of the delay is to repeat the original request for several replicas and use the first answer as the result. We call such requests "hedge requests" because the client first sends a request for a replica that he considers most appropriate, and then, after a short delay, begins to send secondary requests. The client simply refuses to answer the answers that came after the first one. And although the implementation of “in the forehead” of this method adds an unacceptable load on the system, there are many variants of execution that give a serious reduction in the delay time, while only slightly increasing the load. For example, it is possible to postpone the distribution of secondary requests until the waiting time for the first request exceeds the 95th delay percentile for this class of requests. This approach leads to an increase in the load for only 5% of requests, at the same time significantly shortens the tail of the distribution of delays. This method works because the source of the delay is not a specific request and its features, but, most likely, some other forms of interference. For example, in Google’s performance test, where the values ​​of 1000 keys stored in the BigTable table distributed over 100 servers are read, sending hedge requests after a 10 ms delay reduces the 99.9th percentile of the delay in receiving the entire 1000 keys from 1800 ms to 74 ms, This sends only 2% more requests. Overhead costs for hedge requests can be reduced further by setting them to a lower priority than the main requests.
Tied requests
Hedge requests, among other things, have the following feature: there is a time window within which several servers can execute the same request, which will be redundant. This effect can be eliminated by postponing hedge requests until the 95th percentile of the expected delay expires, but this approach can help only a small fraction of requests. The introduction of a more aggressive distribution of hedge requests, given the small consumption of resources, requires the existence of the possibility of quickly canceling requests if necessary.

A common source of delay variation is queues that delay execution of a request. For many services, after the request left the queue (dispatched) and began execution, the variability of the completion time is significantly reduced. According to
Mitzenmacher , the ability of a client to choose between two servers at the time a query is queued, based on the appropriate queue lengths, improves the load balancing exponentially compared to a randomized scheme. We propose not to select, but place the request in several queues at the same time, while allowing the servers to change the statuses of the corresponding copies on other servers. We call such requests, where the server updates the status of the same requests on other servers, "related requests". The simplest example of a related request: the client sends requests to two different servers, while marking each of the requests with the id's name of the other server. When one of the requests starts to run, the server sends a message to cancel the request to the other server. The corresponding request on another server, if it is still in the queue, can be canceled immediately, or its priority can be significantly reduced.
There is a short window, the size of the average delay in sending a message over the network, during which both servers can begin to execute the request by sending cancellation messages to another server. This often happens when the queues on both servers are empty. Therefore, the client needs to enter a certain delay, the size of 2 times the average delay for sending a message over the network (1ms or less for modern data centers), between sending the first and second requests.
Google’s implementation of this technique in the context of a clustered distributed file system effectively reduces the
median delay distribution and tail. Table 2 displays the read request processing time readings in BigTable, in the case when the data is not cached in memory and must be read from the disk; Each piece of data is replicated 3 times on different servers. The table shows the read latencies in the case where related queries were used and without the use of them. Two scenarios were considered: in the first case, the cluster on which the benchmark was launched was isolated, so the variability of delays mainly comes from internal interference and the regular operational activity of the cluster. For this case, sending related requests with an interval of 1 ms reduced the median of the distribution of delays by 16% and was very effective in shortening the tail of the distribution, reducing the 99.9th percentile by 40%. The second scenario is similar to the first one in all, except that the cluster has a task of sorting a large amount of data that competes with the benchmark task for resources. Although the overall delays in this case are higher due to the additional load on the system, the same indicators have been achieved for reducing delays as in the previous scenario. Delays in the use of related queries in a competitive job sorting problems, do not differ from the delays in an isolated cluster, where related queries are not used at all. Related queries allow you to consolidate all the work in one cluster, which results in a serious reduction in computational costs. In both scenarios presented in Table 2, the overhead (associated with the disk) for using related requests was less than 1%, and therefore, the strategy of sending out cancellation messages effectively eliminates the need for excessive reads from the disk.

An alternative to hedge and related requests is the method in which queues of remote servers are checked first, and then requests are sent to less loaded servers. This may be beneficial, but for 3 reasons it is less effective than the methods discussed above: the load level may change in the period between checking the queue and sending the request; estimate the request processing time can be very problematic given the variability of the characteristics of the system and iron; as well as clients can create a hot spot by simultaneously selecting the same (least loaded) server.
Distributed Shortest-Positioning Time First system (understand how you want - comment of the translator), so, this system uses a different approach, in which the initial request is sent to one server and sent to the replicas only if there is no server in the cache response to the incoming request, plus, messages are also used to cancel related requests.
It should be noted that the above method is not limited to simple coding schemes and is suitable, including for more complex ones (for example, for
Reed-Solomon codes ), where the main request is sent to the machine with the necessary data block and if, after a brief wait, the answer is not received, requests are sent to the replica, the answers of which will allow to recreate the necessary data.
Note also that the class of methods described in this part is effective only if the variability phenomenon usually does not affect several replicas required by the request. We believe that the uncorrelatedness of such events is more typical of large-scale systems.
Long term measures within multiple requests
Now we turn to the methods designed to reduce the variability of delays caused by phenomena of a wider spectrum (such as the variability of the operating time of various services and the imbalance of the load). Although many systems attempt to distribute data so that all portions are equivalent, static linking of these portions to different machines may not be very practical for the following two reasons: first, the performance of the machines is not constant over time, and the workload is uneven - for reasons described above (global resources, skipping cycles). Secondly, static binding can cause load imbalance, since one of the data portions can suddenly become very popular, drawing a load on the machine where it is stored.
Micro partitions
To combat the imbalance, Google systems create much more sections (partitions) than real machines in this system, and then dynamically link and balance partitions to existing machines. In this case, balancing is simply a transfer of responsibility for a particular partition of any of the machines. With an average of 20 partitions per machine, the system can increase the load on each machine by increments of 5% while doing so in 1/20 of that time, which would be necessary if the partitions are statically displayed on the machines on a 1-to-1 basis (With an If you’d like to have one, you’ll have to do it. The BigTable distributed system stores data in tablets (tablets) so that each real machine controls 20-1000 tablets at a time. Recovery from failures is also accelerated through the use of micro partitions, since each of the many machines remaining in the ranks simply takes over one of the batches of the broken car. This way of using micro-partitions is similar to the way virtual servers are used, as described in the
work of Stoica and with the virtual-processor-partitioning method described by
DeWitt .
Selective replication
An improvement in the micro-partitioning technique is the ability to detect or even predict which objects may cause imbalance in the future and then create additional replicas of such objects. Load balancing systems can use these additional replicas to spread the load created by these popular objects across multiple machines without having to move micro-partitions anywhere. The Google search engine uses this approach, creating several additional copies of popular and important documents in the form of several additional micro-partitions. In some periods of the search engine evolution, it also created additional micro-partitions of documents in certain languages ​​and changed the replication rate of these micropartitions during the day, depending on the mixture of requests coming into the system. This query mix may change suddenly if, for example, the data center in Asia is de-energized and most of the requests in local languages ​​are redirected to data centers in North America.
Latency-induced probation based trial period
Observing the distribution of response time of different machines in the system, intermediate servers can determine the situation when the system performance can increase if you exclude (send for a trial period) one slow machine. The reason for slowing down is often temporary and is noticeable only under heavy loads: a jump in CPU activity or third-party network traffic (interference from unrelated networking traffic). However, the system continues to send requests to the test machine, collecting statistics on the response time in order to bring it back into service when problems disappear. Generally speaking, this situation, when reducing the delay time occurs due to the exclusion of machines from the system under load, is in its own way special.
Large information retrieval systems
In large information retrieval systems (IPS), the speed is greater than a measure of performance. This is a key quality metric, since returning good results quickly is better than returning great results slowly. The following two methods are applied to such a system:
Good enough
In large IPSs, when most of all leaf servers respond to a request, a slightly inaccurate (“good enough”) result can be returned to the user instead of a small delay time. The fact that a particular leaf server — the server located at the bottom of the query distribution tree (see above) —contains the most accurate answer to the query, perhaps in less than one of the 1000 queries, and the chances of such an event are reduced even more if you replicate the most important documents. enclosures on multiple leaf servers. Since waiting for slow servers can increase the response time of the entire system to unacceptable values, the Google IPS is set to respond with fairly good results when a certain part of the search body has been viewed, while the system ensures that such situations are rare. Also, this method involves ignoring not the most important subsystems in order to ensure quick response; for example, the results of the advertising system and the spelling checker are easily ignored for search queries that are processed for a long time.
Requests of informers (canary requests)
Another problem in heavily branched systems is that another request may affect a part of the code that has not yet been tried before, causing serious delays or errors on thousands of servers at the same time. To prevent such correlated refusals, the Google IPS uses so-called “sniper calls”; instead of immediately sending a request to thousands of servers, the root server sends it first to one or two leaf servers. , . , , . , , DOS-.
- , , . 1. , Google , .
, , , , , . , () : -, , , . -, . -, , (tolerate inconsistent update models for (inherently more latency-tolerant) mutations). , , , , (
Paxos Lamport's algorithm ), because These algorithms should commit only in 3-5 replicas - they are tailored by definition.Trends in iron and their effect
, , . , . , , . , . , (
bisection bandwidth ) , (per-message overheads) ( ,
remote direct-memory access), reduce the cost of related requests, providing the possibility that cancellation requests will be received on time in order to avoid unnecessary work. Reducing the cost of processing a single message will naturally improve multiplexing and avoid blocking from behind the queue head.Conclusion
Creating a computationally intensive, yet smooth next-generation interactive cloud services requires the creation of large-scale, fast-response computing systems that are just beginning to appear. With the growth of the system, attempts to simply remove all sources of productivity variability will not help to make the system responsive. Methods to ensure resiliency were developed in due time, because it is simply impossible to guarantee trouble-free operation of a system that has reached a certain level of complexity Also, methods for ensuring tailing resistance are designed specifically for large-scale services, since deletion of all sources of variation is also impossible. Although techniques that focus on a specific source of time delay variability can be quite useful,General methods for ensuring tailing resistance reduce the delay time, regardless of their cause. These methods allow developers to continue to optimize the system for tasks for which it is intended directly, while giving stability to work in unforeseen situations. Here we have described some of the tailing-resistant techniques that have proven useful in systems created by Google. Their importance will only increase with time, along with the growing need of modern Internet services in complex data center-scale computing systems and an increase in the productivity variability of the hardware used in them.while giving stability to work in unforeseen situations. Here we have described some of the tailing-resistant techniques that have proven useful in systems created by Google. Their importance will only increase with time, along with the growing need of modern Internet services in complex data center-scale computing systems and an increase in the productivity variability of the hardware used in them.while giving stability to work in unforeseen situations. Here we have described some of the tailing-resistant techniques that have proven useful in systems created by Google. Their importance will only increase with time, along with the growing need of modern Internet services in complex data center-scale computing systems and an increase in the productivity variability of the hardware used in them., , , , c . , , . , () .