Recently, at Habré, discussions of scalable systems and NoSQL solutions have become more frequent. This article, written by Amazon’s CTO, is one of the best introductory, in my opinion, showing what problems arise when building scalable systems, what to consider when choosing a toolkit, what Cassandra authors mean when it comes to providing AP in Cassandra and CP in HBase and much more.
Introduction
Amazon’s cloud computing is based on infrastructure services. Such as Amazon S3, SimpleDB and EC2. They allow you to build scalable computing platforms and applications. The requirements for these services are very strict. They must provide excellent security, scalability, availability, performance, and efficient use of resources. And all this - while serving millions of customers from around the world.
Inside these services are huge distributed systems operating globally. This creates additional difficulties, since, when processing trillions and trillions of requests, events that usually happen with a very low probability now happen to happen. And this must be considered when designing and developing the system. Globally, we use replication everywhere to ensure the required performance and high availability. Although replication brings us closer to our goals, it still does not allow us to achieve them transparently. There are a number of nuances that users of replication services will encounter.
One such nuance is the type of data consistency provided by the system. In particular, many common distributed systems use the eventual consistency model in the context of data replication. When developing large scalable systems in Amazon, we used a set of rules and abstractions related to data replication in large scalable systems. We focused on finding a compromise between high availability and data consistency. In this article, I will look at some of the information that has shaped our approach to building reliable, distributed systems operating on a global scale.
Historical perspective
In an ideal world, there would be only one consistency model: once the data has been updated, all observers will see updates. The first difficulties in achieving this arose in the DBMS in the late 70s. The best work on this topic is Bruce Lindsay's Notes on Distributed Databases. He outlines the basic principles of database replication and discusses a number of techniques related to achieving consistency. Many of these techniques are trying to achieve transparency of distribution - so that from the user's point of view, it looks like a single system, and not as many related systems. Many systems of that time followed the approach that a failure of the entire system is better than a violation of transparency.
In the mid-90s, with the growth of systems on the Internet, this practice was revised. At this time, people began to incline to the view that accessibility is the most important feature, but they could not decide what to sacrifice as a compromise. Eric Brewer, a professor at Berkeley who was the head of Inktomi at that time
(the company that launched the successful search engine that was later absorbed by Yahoo - brought together all the compromises together in a report at the 2000 PODC conference. He presented the CAP theorem, which states that of the three properties of systems with distributed data - data consistency (consistency), system availability when one of the nodes (system availability) fails, and resistance to loss of communication between network segments (partition tolerance)
(hereinafter Segmentation of a network means loss of communication between parts of a distributed system, when each part is separately operable, but they do not “see” each other (note of the lane) - only two can be achieved simultaneously. A more formalized confirmation was published in an article by Seth Gilbert and Nancy Lynch in 2002.
A system that does not provide resilience to the loss of communication between network segments can achieve data consistency and availability, which is often achieved using a transaction protocol. In this case, certain situations are treated as a system failure. For example, if the client does not see part of the nodes. It should be noted that in large scalable systems segmentation is often present, because data consistency and accessibility are not achievable at the same time. This means that we have two choices: to weaken the consistency, which will create a system with high availability in terms of network segmentation, or focus on consistency, which will lead to inaccessibility of the system in certain situations.
Both options require the attention of the client developer to the capabilities of the system. If the system focuses on integrity, then the developer should keep in mind that the system may be inaccessible, for example, for recording and accordingly handle this situation in order not to lose data. If the system focuses on accessibility, then it can always provide a record, but reading the data in some cases will not reflect the result of the recently implemented record. The developer has to decide whether the client really needs the most recent changes. In many cases, it is permissible to use slightly outdated data.
In principle, consistency in ACID-compliant transaction systems is a slightly different kind of consistency assurance. In ACID, consistency implies a guarantee that upon completion of a transaction, the database is in a consistent state. For example, when transferring money between accounts, the amount of money in the accounts should not change. In ACID-compliant systems, this kind of consistency is usually provided by the use of transactions and the data integrity database.
Consistency - Client and Server
There are two views on consistency. One from the point of view of a developer / client: how they see data updates. The second is from the server side: how are the updates in the system and what the system can guarantee regarding updates.
')
Client Consistency
From the client’s point of view, we have the following components:
Storage system At the moment we consider it as a black box. But consider that inside something highly scalable and distributed, built to ensure sustainability and accessibility.
Process A A process that writes to the storage system and reads from it.
Processes B and C. Two processes, independent of Process A, that also write and read the storage system. It does not matter whether they are processes or threads of a single process. What matters is that they are independent and must interact to exchange information.
Client-side consistency determines how and when observers (in our case processes A, B, and C) see changes in the object's data in the storage system. In the following examples, illustrating the different types of consistency, process A has updated the data.
Strong consistency . After the update is complete, any subsequent access to the data (Process A, B, or C) will return the updated value.
Weak consistency . The system does not guarantee that subsequent accesses to the data will return the updated value. Before the updated value is returned, a number of conditions must be met. The period between the update and the moment when each observer is always guaranteed to see the updated value is called the
inconsistency window .
Eventually consistency . A special case of poor consistency. The system ensures that, in the absence of new data updates, ultimately, all queries will return the latest updated value. In the absence of failures, the maximum inconsistency window size can be determined based on factors such as communication latency, system load, and number of replicas according to the replication scheme. The most popular system implementing “consistency in the long run” is DNS. The updated entry is distributed in accordance with the configuration settings and the settings of the caching intervals. Ultimately, all customers will see an update.
Eventually consistency (Eventual consistency) has many variations that are important to consider:
Causal consistency . If process A has informed process B that it has updated the data, then subsequent accesses of process B to this data will return the updated values and the record is guaranteed to replace the earlier one. The access of process C, which is not in causal connection with process A, follows the usual rules of eventual consistency.
Consistency model “Read what you wrote down” (Read-your-writes consistency) . This is an important model in which Process A, after updating data, always receives the updated value when it is accessed and never sees the older one. This is a special case of causal consistency.
Session consistency (Session consistency) . This is a practical version of the previous model, when the process accesses the repository in the context of the session. As long as the session exists, the system guarantees read-your-writes consistency. If the session ends due to some kind of failure, then a new session should be created that is guaranteed not to overlap with others.
The “uniform reading” consistency model (Monotonic read consistency) . If a process sees a certain value, then, on subsequent accesses to this data, it will never receive an older value.
The monotonic write consistency model (Monotonic write consistency) . In this variant, the system guarantees the orderliness of the recording of one process. Systems that do not provide this level of consistency are difficult to use.
Some of these variations can be combined. For example, you can combine monotonic reads and session consistency. From a practical point of view, monotonic reads and read-your-writes are most desirable in systems that implement “consistency in the long run”, but are not always necessary. This combination facilitates application development while allowing the storage system to reduce consistency and ensure high availability.
“Consistency in the long run” (Eventual consistency) is not some kind of esoteric poetry of extreme distributed systems. Many modern relational DBMSs that provide reliability with duplication to a backup server (primary-backup reliability) implement the operation of the replication mechanism in two modes: synchronous and asynchronous. In synchronous mode, updating a replica is part of a transaction. In asynchronous mode, the update is delivered as a backup, with some delay, often through the delivery of logs. In the latter case, if the primary server refuses before the log is delivered, reading from the backup server, raised instead of the main server, will return the outdated data to us. Also, to ensure better read scalability, relational DBMSs began to provide read access from a backup server, which is a classic case of guarantees of “consistency in the end”, in which the size of the inconsistency window depends on the frequency of sending the log.
Server side consistency
On the server side, we need to further understand how updates are distributed in the system in order to understand what this or that method gives the developer. Let's first agree on several definitions:
N = number of nodes storing copies (replicas) of data;
W = the number of replicas that must confirm the receipt of the update before the update is considered complete;
R = number of replicas with which a connection is established when processing a request to read data.
If W + R> N, then the sets of replicas involved in writing and participating in reading always intersect, which can guarantee strong consistency. In the mechanism of synchronous replication to a backup server, relational DBMS is N = 2, W = 2 and R = 1. No matter which replica reads from, actual data will always be read. With asynchronous replication and read on from the backup server, N = 2, W = 1 and R = 1. In this case, R + W = N and data consistency cannot be guaranteed.
The problem with such configurations is that when it is impossible to write to W nodes due to a failure, the write operation should return an error, noting that the system is unavailable. For example, when N = 3, W = 3, and two available nodes, the system should generate an error when writing.
In distributed repositories, which should provide high performance and availability, the number of replicas is generally more than two. A system that focuses only on fault tolerance often uses N = 3 (with W = 2 and R = 2). Systems that must maintain a very high reading load often use more replicas than is necessary to ensure fault tolerance. The value of N can be several tens, or even hundreds of nodes, with R = 1, so that a query on one node returns a result. Systems focused on data consistency set W = N for updates, which can reduce the likelihood of successful completion of a record. Frequent configuration for systems that require fault tolerance, but do not require strong consistency - work with W = 1 to get the minimum update duration and then update the rest of the replicas using a lazy (lazy, epidemic) technique.
How to configure N, W and R depends on use cases and on performance requirements for different loads. With R = 1, N = W, we optimize the read speed, and with W = 1, R = N, we optimize the system for a very fast write. Of course, in the latter case, the survival of the system in case of failures is not guaranteed, and with W <(N + 1) / 2 there is the possibility of the occurrence of conflicting records when the sets of nodes do not overlap in various write operations.
Weak (weak / eventual) consistency occurs when W + R <= N. Those. there is a possibility that the sets of nodes will not intersect when writing and reading. If this is a deliberate step and not because of the requirements for fault tolerance, then installing R into something other than 1 makes almost no sense. Weak consistency occurs in two main cases: the first is replication to multiple nodes to ensure read scaling, as noted above, and the second option, with more complex data access. In simple systems, the key value is fairly simple to compare versions to determine which value was last written. But in systems that return sets of objects, it is more difficult to determine which of the sets to be considered the last relevant. Most systems in which W <N contain a mechanism that updates data on the right nodes (not included in the W set) in the background. The period before the update of all nodes is the inconsistency window, which was discussed earlier. If W + R <= N, then the system can read data from nodes that have not yet received the update.
The ability to implement read-your-writes, session, monotonic consistency models generally depends on the client's binding to a specific server, which provides work with the entire distributed system. When a client accesses the same server each time, the implementation of read-your-writes and monotonic reads is quite simple. It is somewhat more difficult to implement load balancing and fault tolerance, but this is a simple solution. It makes it possible to use these sessions, which are sticky.
Sometimes read-your-writes and monotonic reads are implemented by means of the client. Adding versions to the records, the client discards the value with versions less than the last one encountered.
Segmentation occurs when some nodes of the system cannot connect to other nodes, but both sets of nodes are available to clients. When using the classic quorum mechanism, a segment with W nodes can continue to function and receive updates, while another segment becomes unavailable. Similar reasoning applies to reading. Since the sets of nodes when reading and writing intersect by definition, the smaller set of nodes becomes inaccessible. Segmentation happens infrequently, but can occur both between data centers and inside data centers.
In some applications, the inaccessibility of part of the nodes is unacceptable, and it is important that the client, which interacts with any segment, can work normally. In this case, both segments define a new set of nodes for storing data, and a merge operation is performed when the connection between segments is restored.
Amazon's dynamo
Amazon's dynamo is a system that allows you to customize all the parameters discussed above in accordance with the architecture of the application. This key-value storage system is used in many e-commerce platform services and Amazon web services. One of the goals of Dynamo development is to allow owners of services that use Dynamo storage system instances, often distributed across multiple data centers, to define a compromise between consistency, stability, availability, and system performance.
Summarizing the above
Inconsistency of data in highly scalable reliable distributed systems should be acceptable for two reasons: improved read and write performance, if there are many competitive requests; segmentation processing, when otherwise it is necessary to declare a part of the system inaccessible, even if all nodes are working.
Whether inconsistency is acceptable depends on the client application. In any case, the developer should not forget what kind of consistency is provided by the storage system and take this into account when developing applications. There are a number of practical improvements to the “eventual consistency” model, such as “session consistency” and “monotonous reading”, which make life easier for developers. Often, an application can use eventual consistency without any problems. A particular case of concern is a web site where we have the concept of consistency from a user's point of view. In this case, the inconsistency window should be less than the expected time for the user to move to the next page. This allows you to distribute the update in the system until the next read request.
The purpose of this article is to raise awareness of the complexity of systems that need to operate globally and require careful tuning to ensure that they can provide the performance, availability, and sustainability required by the application. One of the things that designers of distributed systems have to work with is the size of the inconsistency window, during which system clients can experience the realities of developing highly scalable systems.
Comments and suggestions for improving the translation are welcome.