The book “High-loaded applications. Programming, scaling, support

In this book you will find the key principles, algorithms and compromises, without which it is impossible to do when developing high-loaded systems for working with data. The material is considered on the example of the internal structure of popular software packages and frameworks. The book has three main parts, devoted primarily to the theoretical aspects of working with distributed systems and databases. The reader is required to have basic knowledge of SQL and database principles.

The review post deals with the section "Knowledge, Truth, and Falsehood."

If you have no experience with distributed systems, the consequences of these problems can be very misleading. The network node knows nothing for sure - it can only make assumptions based on the messages it receives (or does not receive) over the network. One node is able to find out the status of another node (what data is stored on it, whether it is working correctly) only by exchanging messages with it. If the remote host does not respond, then there is no way to find out its status, since it is impossible to distinguish between network problems and problems on the host.

')
Discussions of these systems border on philosophy: what is true in our system, and what is a lie? Can we rely on this knowledge if the mechanisms of knowledge and measurement are unreliable? Should software systems obey the laws of the physical world, such as the law of cause and effect?

Fortunately, we do not need to look for the meaning of life. For a distributed system, it is possible to describe the assumptions regarding the behavior (system model) and design it so that it corresponds to these assumptions. It is possible to verify the correct operation of algorithms within a specific system model. This means that reliability is quite achievable, even if the underlying model of the system provides very few guarantees.

However, although it is possible to ensure proper operation of the system with an unreliable model, this is not easy to do. In the remainder of this chapter, we will discuss the concepts of knowledge and truth in distributed systems, which will help us deal with the necessary assumptions and guarantees. In Chapter 9, we proceed to consider a number of examples of distributed systems and algorithms that provide specific guarantees with specific assumptions.

Truth is determined by the majority

Imagine a network with an asymmetric failure: the node receives all messages sent to it, but all its outgoing messages are delayed or discarded altogether. And although it works fine and receives requests from other nodes, these others do not receive its responses. As a result, after some waiting time, other nodes declare it inoperative. The situation turns into some kind of nightmare: the node half-disconnected from the network is forcibly dragged by a graveyard, and he fights back and shouts: “I'm alive!”. But since no one hears his screams, the funeral procession continues to move steadily.

In a slightly less nightmarish scenario, a node that is half disconnected from the network may notice a lack of confirmation of the delivery of its messages from other nodes and understands that there is a network failure. However, other nodes mistakenly declare the half-disabled node to be inoperative, but he is not able to do anything about it.

As a third scenario, imagine a node suspended for a long time due to a long comprehensive garbage collection. All its threads are pushed out of memory by the garbage collection process and suspended for a minute, and therefore requests are not processed, and responses are not sent. Other nodes wait, repeat the dispatch, lose patience, eventually declare the node inoperative and “send it to the hearse”. Finally, the garbage collection process ends and the node flows resume operation as if nothing had happened. The rest of the nodes are surprised at the “resurrection” in full health of the node declared as “dead” and begin to happily chat with eyewitnesses of this. At first, this node does not even realize that a full minute has passed and it was declared “dead” - from his point of view almost an instant passed since the last exchange of messages with other nodes.

The moral of these stories: the site can not rely only on their own opinions about the situation. A distributed system cannot fully rely on a single node, since it can refuse at any time, leading to a failure and the impossibility of restoring the system. Instead, many distributed algorithms base their work on a quorum, that is, the decision of the majority of nodes (see the “Writing and reading quorum operations” section of the “Writing to the database when one of the nodes fails” section 5.4): decision making requires a certain minimum number “Votes” from multiple nodes. This condition reduces the dependence on any one particular node.

This includes making decisions about declaring nodes inoperative. If the quorum of nodes declares another node inoperative, then it is considered as such, even if it works fine at the same time. Individual nodes are required to comply with quorum decisions.

Typically, a quorum is an absolute majority of more than half of the nodes (although there are other types of quorums). The quorum on the majority allows the system to work in case of failures of individual nodes (with three nodes, one failure is allowed, with five - two failures). This method is safe because the system can only have one majority - two at the same time, whose solutions conflict, are impossible. We discuss the use of quorums in more detail when we get to consensus algorithms in Chapter 9.

Lead and lock

Often, the system requires only one instance of something. For example:

only one node can be the master for a database section in order to avoid the situation of the division of computing power (see the section “Interruptions to the maintenance of nodes” section 5.1);
only one transaction or one client can hold a lock on a specific resource or object in order to prevent competitive entry into it and damage it;
only one user can register a specific username, since it must uniquely identify the user.

Implementing this in a distributed system requires caution: even if a node is sure of its “election” (the section's leading node, the owner of the lock, the request handler that successfully acquired the user name), it is not at all the fact that the quorum of the other nodes agrees with this! A node could previously have been a master, but if then other nodes declared it inoperative (for example, due to a network break or a pause for garbage collection), then it could well have been “demoted” and selected another lead node.

The case where a node continues to behave as an “elected”, despite the fact that the quorum of the other nodes declared it inoperative, can lead to problems in an insufficiently carefully designed system. Such a node is able to send messages to other nodes as a self-styled “favorite”, and if other nodes agree with this, the system as a whole may start to work incorrectly.

For example, in fig. 8.4, an error with data corruption due to improper lock implementation is shown (this is not at all a theoretical error: it often occurs in the HBase DBMS). Suppose we want to make sure that only one client can access a file in the storage service at the same time, because if several clients immediately try to write to it, the file will be corrupted. We will try to do this with the client's obligatory receipt of a lease agreement from the service of locks before accessing the file.

The described problem is an example of what we discussed in the subsection “Pauses during the execution of processes” of section 8.3: if the leasing client is suspended for too long, then the term of its lease agreement expires. After that, another client can get a lease on the same file and start recording data into it. After the resumption of work, the suspended client believes (erroneously) that he still has a valid lease agreement, and proceeds to write to the file. As a result, the write operations of two clients are shuffled and the file is spoiled.

Enclosing markers

When using a lock or lease agreement to protect access to a resource, for example, the file storage in fig. 8.4, it is necessary to make sure that the node, which mistakenly considers itself “elected”, will not disrupt the operation of the rest of the system. For this purpose, there is a fairly simple method, shown in Fig. 8.5. It is called fencing.

Imagine that each time a lock is granted or a lease agreement, the lock server also returns a fencing token, which is a number that is incremented each time the lock is granted (for example, it can build up the lock service). In addition, we require that the client each time it sends a write request to the storage service includes such a token in the current request.

In fig. 8.5 Client 1 receives a lease agreement with marker 33, after which it is permanently suspended and the lease agreement expires. Client 2 receives a lease agreement with marker 34 (the number monotonically increases), and then sends a write request to the storage service, including this marker in the request. Later, client 1 resumes operation and sends a write operation with a marker 33 to the storage service. However, the storage service remembers that it has already processed a write operation with a large number of the marker (34) and rejects this request.

When using ZooKeeper as a lock service, you can use the transaction identifier zxid or the version of the cversion node as a protecting marker. Since ZooKeeper guarantees their monotonous growth, they have the necessary properties for this.

Note: this mechanism requires that the resource itself take an active part in checking tokens, rejecting all write operations with markers older than those already processed, checking locking status on the clients themselves is not enough. It is possible to bypass the restriction for resources that do not explicitly support enclosing markers (for example, in the case of a file storage service, the enclosing marker is included in the file name). However, some verification is still needed to avoid processing requests without blocking protection.

Checking server-side markers may seem like a drawback, but it’s quite likely that it’s not so bad: it would be unwise for a service to assume that all its clients always “behave themselves”, since clients often do not run people with such priorities as from the owners of the service. Consequently, for any service it would be a good idea to protect yourself from unintentional misconduct by customers.

Byzantine failures

Fencing markers are able to detect and block a node that unintentionally performs erroneous actions (for example, because it has not yet detected the expiration of its lease agreement). However, a node intentionally wanting to undermine the system guarantees can easily do this by sending a message with a fake marker.

In this book, we assume that nodes are unreliable, but “fair”: they may work slowly or not respond at all (due to a failure), their state may be outdated (due to a pause for garbage collection or network delays), but if the node is at all he answers, he “speaks the truth” (observes the rules of the protocol within the framework of the information he has).

The problems of distributed systems are greatly exacerbated when there is a risk that nodes can “lie” (send random failed or corrupted responses) —for example, if a node can declare it has received a certain message when in fact it was not. This behavior is called the Byzantine fault (Byzantine fault), and the task of achieving consensus in such an untrustworthy environment is known as the task of the Byzantine generals (Byzantine generals problem).

The system is protected from Byzantine failures, if it continues to work correctly even in the event of malfunction of some nodes and non-compliance with the protocol or when intruders interfere with the network. This may be important in the following circumstances.

In the aerospace industry, data in computer memory or CPU registers may deteriorate due to radiation, which causes unpredictable responses to be sent to other nodes. Since a system failure can lead to catastrophic consequences (for example, a plane crash and the death of everyone on board or a rocket collision with the International Space Station), flight control systems must be protected from Byzantine failures.
Some participants in the system with many participating organizations may try to cheat or deceive the rest. In such circumstances, the node should not blindly believe messages from other nodes, since they can be sent with malicious intent. For example, peer-to-peer networks such as Bitcoin and other block chains can be viewed as a way for non-trusting parties to agree on whether a transaction has occurred without relying on any central authority.

However, in such systems that we discuss in this book, you can usually safely assume that there are no Byzantine failures. In your data center all nodes are monitored by your organization (so they hopefully trusted), and the radiation levels are low enough so that memory corruption does not pose a serious problem. The protocols for creating systems that are protected from Byzantine failures are quite complex, and such embedded systems require hardware support. In most server information systems, solutions that are protected from Byzantine failures are too expensive to make sense to use them.

At the same time, it makes sense for web applications to expect arbitrary and malicious behavior from end-user-controlled users, such as browsers. Therefore, validation of input data, control of correctness and screening of output are so important: for example, to prevent SQL injection (SQL injection) and cross-site scripting. However, this does not usually apply Byzantine failures-protected protocols, and the server is simply delegated the authority to decide whether client behavior is acceptable. In peer-to-peer networks where there is no such central authority, protection from Byzantine failures is more appropriate.

You can consider software errors as Byzantine failures, however, if the same software is used in all nodes, then protection algorithms from Byzantine failures will not save you. Most of these algorithms require a qualified majority of more than two-thirds of normally working nodes (that is, in the case of four nodes, a maximum of one may not work). To solve the problem of errors with this approach, you would have to use four independent implementations of the same software and hope to have an error in only one of them.

Similarly, it seems tempting for the protocol itself to protect us from vulnerabilities, security breaches, and malicious acts. Unfortunately, in most systems this is unrealistic. If an attacker manages to gain unauthorized access to a single node, it is very likely that he will be able to gain access to the others, since the same software most likely runs on them. Therefore, traditional mechanisms (authentication, access control, encryption, firewalls, etc.) remain the main defense against attacks.

Weak forms of "lies". Although we assume that the nodes are predominantly “fair”, it makes sense to add protection against weak forms of “lies” to the software — for example, incorrect messages due to hardware problems, software errors and incorrect settings. Such mechanisms can not be considered a full-scale protection against Byzantine failures, because they can not be saved from a determined attacker, but these are simple and practical steps to increase reliability. Here are some examples.

Damage to network packets sometimes occurs due to hardware problems or errors in the operating system, drivers, routers, etc. Normally, damaged packets are intercepted during checksum checks built into TCP and UDP protocols, but sometimes they avoid checking. To detect packet corruption, simple measures are usually sufficient, for example, checksums in an application-level protocol.
Open-to-public applications should carefully monitor the correctness of user input, for example, check if the values are within the acceptable range and limit the length of the lines to avoid denial of service due to the allocation of too large amounts of memory. For a service used internally, behind a firewall, fairly less stringent checks of the input data, but some simplest control of values (for example, when parsing a protocol) will not interfere.
NTP clients can be configured to use multiple server addresses. During synchronization, the client accesses all of them, counts their errors and makes sure that there is a certain period of time with which most servers agree. If most servers function normally, then an incorrectly configured NTP server that returns erroneous time is regarded as an anomaly and is excluded from synchronization. The use of multiple servers increases the NTP protocol fault tolerance compared to a single server.

System models in practice

To solve problems of distributed systems, many algorithms have been developed; for example, we will consider solutions for the consensus problem in Chapter 9. To bring some benefit, these algorithms must be able to cope with the various failures of distributed systems that we discussed in this chapter.

Algorithms should as little as possible depend on the hardware and software settings of the system on which they work. This, in turn, requires the formalization of types of probable failures. To do this, we describe a system model — an abstraction that describes the assumptions made by the algorithm.

As for timing assumptions, three system models are often used.

Synchronous. It implies the limitations of network delays, pauses of processes and discrepancy of hours. This does not imply perfect clock synchronization or zero network latency, but simply means that the duration of network delays, pauses of processes and clock divergences never exceeds a certain fixed upper limit. The synchronous model is unrealistic for most systems used in practice, since (as discussed in this chapter) unlimited delays and pauses sometimes occur.
Partially synchronous. It means that most of the time the system behaves as synchronous, but sometimes goes beyond the specified duration of network delays, pauses of processes and the discrepancy of hours [88]. This is quite a realistic model for most systems: most of the time, networks and processes function normally — otherwise, nothing would work. However, it is necessary to take into account such a fact: any assumptions regarding timekeeping are scattered into dust from time to time. In this case, network delays, pauses, and clock errors can reach arbitrarily large values.
Asynchronous. In this model, the algorithm does not have the right to build any temporal assumptions - in fact, there are even no watches (so there is no concept of waiting time). Some algorithms can be adapted for an asynchronous model, but it greatly limits the developer.

But, in addition to problems with timing, you should consider the possible failures of nodes. Here are the three most common system models for nodes.

Model “failure - stop”. Here, the algorithm considers that a node failure can only be fatal. That is, the node is able to stop responding at any time, after which its work will never be resumed.
Model "failure - recovery." It is assumed that a node failure may occur at any time, but then the node will probably start responding again after an unknown period of time. In this model, it is assumed: the nodes have reliable storage (that is, non-volatile storage media), in which data is not lost in case of failures (while the data in the RAM are considered lost).
Byzantine (arbitrary) failures. Nodes can do whatever they please, including attempts to deceive other nodes and deliberately mislead them, as described in the last subsection.

A partially synchronous model of the “failure - recovery” type is usually best suited for modeling real systems. But how do distributed algorithms handle it?

Algorithm Correctness

To define the correctness of the algorithm, I will describe its properties. For example, the results of the sorting algorithm have the following property: for any two different elements of the output list, the element on the left is smaller than the element on the right. This is just a formal way of describing sorting a list.

Similarly, the properties that are required from the correct distributed algorithm are formulated. For example, when generating enclosing markers for blocking, the following properties may be required from the algorithm:

uniqueness - no two queries of the enclosing marker result in the return of the same value;
monotonous increase in values - if the query x returns the marker tx, and the query y returns the marker ty, and x ends before the start of execution y, then tx <ty;
accessibility - a node with which a fatal failure did not occur, having requested a fencing marker, eventually receives an answer to its request.

The algorithm is correct in some model of the system, provided that it always satisfies these properties in all situations capable, as we assume, of arising in this model of the system. But does it make sense? If a fatal crash occurs with all nodes or all network delays suddenly drag on to infinity, then no algorithm can do anything.

Functional safety and survivability

To clarify this situation, it is necessary to distinguish two different types of properties: functional safety (safety) and survivability (liveness). In the example just given, the properties of uniqueness and monotonous increase in values relate to functional safety, and accessibility to survivability.

What is the difference between these two types of properties? Distinguishing feature: the presence in the definition of the survivability properties of the phrase “in the end” (and yes, you are absolutely right: ultimate consistency is a tenacity property).

Functional safety is often informally described by the phrase “nothing bad happened,” and survivability, “something good will happen over time.” However, it is better not to get involved in such informal definitions too much, since the words “bad” and “good” are subjective. Genuine definitions of functional safety and survivability are mathematically accurate.

If the functional safety property is violated, then there is a specific moment in time of this violation (for example, if the uniqueness property is violated, a specific operation can be defined at which a duplicate token is returned). In case of violation of the functional safety property, the damage has already been done, nothing can be done about it.
The survivability property, on the contrary, may not be tied to any point in time (for example, a node could send a request, but still not receive a response), but there is always hope that it will be satisfied further (for example, by receiving a response) .

The advantage of separating functional safety and survivability properties is in simplifying work with complex system models. In the case of distributed algorithms, it is often required that the functional safety properties are always respected, in all possible situations, of the system model. That is, even in the case of a fatal failure of all nodes or the entire network, the algorithm must ensure that it does not return the wrong result (that is, the functional safety properties are observed).

However, clarifications are likely for survivability properties: for example, it can be said that the response to a request should be returned only in the absence of a fatal failure of most of the nodes and if the network eventually recovered after a service interruption. The definition of a partially synchronous model requires that the system gradually return to the synchronous state, that is, any period of network interruption lasts only a limited time, after which it is restored.

Binding system models to the real world

The functional safety and survivability properties, as well as system models, are very convenient for determining the correctness of a distributed algorithm. However, when the algorithm is put into practice, the cruel reality comes into its own and it becomes clear that the system model is only a simplified abstraction of reality.

For example, algorithms in the “fatal failure - recovery” model usually assume that the data in reliable repositories experience fatal failures. However, what happens if the data on the disk is corrupted or the data is erased due to a hardware error or an incorrect setting [91]? And if the server’s firmware contains an error and it stops “seeing” its hard disks after a reboot, although they are properly connected to the server?

Quorum algorithms (see the “Writing and reading by quorum” section of the “Writing to the database when one of the nodes fails” section of Section 5.4) rely on the nodes to memorize data whose storage has been declared. The possibility of “amnesia” of a node and forgetting it of previously saved data violates the conditions of a quorum, and hence the correctness of the algorithm. Probably, a new model of the system is needed with the assumption that reliable storage in most cases experiences fatal failures, but sometimes is capable of losing data. However, to justify this model is more difficult.

In the theoretical description of the algorithm, it can be declared that certain things simply should not happen - and in non-Byzantine systems we just make assumptions as to which failures can occur, and which fails. However, in practice, the implementation sometimes requires the inclusion of code to handle the case when something supposed impossible happened, even if this processing comes down to printf (“You are not lucky”) and exit (666), that is, the operator will have to clean everything up. -man. (This, according to some, is the difference between computer science and software engineering.)

This does not mean that theoretical, abstract systems are worth nothing - just the opposite. They are extremely useful in extracting from the entire complexity of a real system an acceptable set of failures that can be considered in order to understand the problem and try to solve it systematically. It is possible to prove the correctness of the algorithm by demonstrating that its properties are always observed in a certain model of the system.

The proof of the algorithm correctness does not mean that its implementation in a real system will always behave correctly. But this is a very good first step, since a theoretical analysis will reveal problems in the algorithm that may remain hidden in the real system and manifest themselves only in the event of a collapse of assumptions (for example, regarding timing) due to some unusual circumstances. Theoretical analysis and empirical testing are equally important.

»More information about the book can be found on the publisher's website.
» Table of Contents
» Excerpt

For Habrozhiteley 20% discount coupon - Programming

Source: https://habr.com/ru/post/352742/

All Articles