Notes on distributed systems for beginners

Hello everyone!

The time has come to tell you about another book that has caused us genuine interest and serious debate.

We assumed that in the field of learning algorithms for distributed systems brevity is the sister of talent, so the study of the book by Wan Fokkin “Distributed Algorithms. Clear approach ” is a promising and rewarding affair, even if the volume of the book is only 248 pages.
')

However, in order to participate in the survey was more interesting, we first invite you under the cat, where is the translation of the most interesting article by Jeff Hodges, describing the most diverse problems associated with the development of distributed systems.

I have long thought about what lessons are learned at work by engineers for the maintenance of distributed systems.

Much of this knowledge is acquired when we burn ourselves from our own mistakes in real “combat” traffic. These burns are well remembered, but I would like as many engineers as possible not to make such mistakes.

Young system engineers should definitely read “Misconceptions about distributed computing” and get acquainted with the “CAP-theorem” in the order of self-education. But these are abstract works without direct practical advice — namely, such advice is necessary for a young engineer to start growing. It's amazing what a scanty context such an engineer has when he just starts working.

Below are some of the lessons that I learned as an engineer for distributed systems that I would like to teach a young colleague. Some lessons are not obvious, others are surprising, but most of them are ambiguous. This list will help the novice engineer of distributed systems to orient themselves in the subject area, where he will begin his professional career. The list is not exhaustive, but quite good for a start.

The main drawback of this list is that it focuses on technical problems, but almost does not touch upon social issues that an engineer may encounter. Since distributed systems require more machines and investments, such engineers have to interact with various teams and large organizations. Communication is usually the most difficult part of the work of any programmer, and in particular this concerns the development of distributed systems.

Our leaven, education and experience are pushing us towards technical solutions, even if there is a more pleasant and effective social solution. Let's try to fix it. People are not as capricious as computers, even if the human interface is worse standardized.

So let's go

The feature of distributed systems is that they often fail.

If you ask what distinguishes distributed systems from other branches of programming, the young specialist will certainly mention delays, believing that they are complicating distributed computing. But this is a mistake. The main feature of distributed systems is the high probability of failure and, worse, the probability of partial failure. If a well-written mutex unlock fails and gives an error, you can assume that the process is unstable and complete it. But the failure to unlock a distributed mutex must be taken into account in the lock protocol.

A system engineer who did not deal with distributed computing might think: “OK, I’ll just send a record to both machines” or “OK, I’ll send the record until it works.” Such engineers simply did not quite accept (although they realized intellectually) the fact that network systems fail more often than systems running on separate machines, and such failures are usually partial and not total. One record can go through, while others can not; and how do we then get a consistent view of the data? It is much more difficult to judge such partial failures than total ones.

Switches are disabled, because of pauses in the garbage collection, the leading disks "disappear", writing to sockets would seem to work, but does not actually work on another machine, because of the slow drive on one computer, the information exchange protocol in the entire cluster works with turtle speed, etc. Reading from local memory is simply a more stable process than reading from multiple switches at once.
When designing, consider such failures.

Writing fault-tolerant distributed systems is more costly than reliable single systems

Creating a fault-tolerant distributed solution requires more money than a similar program for one machine. The fact is that some types of failures can occur only immediately on many machines. Virtual machines and cloud technologies reduce the cost of distributed systems as such, but not the development, implementation and testing of such systems on an existing computer. There are such failure conditions that are difficult to simulate on a separate machine. If the situation can occur only with much larger data sets than can fit on a shared machine, or in conditions potentially possible only in the data center network, distributed systems usually require real, not simulated distribution - only this way you can eliminate their bugs. Simulations are, of course, also very useful.

Reliable distributed open source systems are much less common than reliable systems for single machines.

The cost of long-term operation of numerous machines is a heavy burden for free communities. The development of free development - the merit of enthusiasts and amateurs, but they do not have sufficient financial resources to investigate and correct all the many problems that arise in a distributed system. Enthusiasts write open source code for interest in their free time, on computers that they already have. Such a developer is much more difficult to acquire a "fleet" of machines, to overclock and maintain them.
Part of this work is taken up by programmers working in large corporations. However, the priorities of the organization in this case may well not coincide with yours.
Although some members of the free community are aware of this problem, it has not been solved yet. It's hard.

Coordination is very difficult.

If possible, try to do without coordinating the machines. Such a strategy is often referred to as "horizontal scalability." The main trump card of horizontal scalability is independence. It allows you to deliver information to machines in such a mode that the need for communication and coordination between these machines is minimized. Each additional agreement between the machines complicates the implementation of the service. The speed of information transfer has a certain limit, and network communication is more flimsy than you might think. In addition, you probably do not quite understand what “harmonization” is. In this case, it is useful to study the tasks of two generals and Byzantine generals (by the way, the Paxos algorithm is still very difficult to implement; here the old grumbling programmers do not even try to show that they are smarter than you).

If you can fit a task in memory, then it is probably trivial.

To a distributed systems engineer, any task that can be solved on a single machine seems trivial. It will be more difficult to quickly determine the order of data processing if there are several switches between you and this information, that is, dereferencing several pointers is not enough. In distributed systems do not work the good old tricks to improve efficiency, documented at the dawn of computer science. There is a mass of literature and documentation on algorithms performed on single machines, precisely because most of the calculations are done on single, uncoordinated computers. Similar books on distributed systems are much smaller. (italics our - ph_piter)

"Brakes" - the hardest problem, debugging which you will ever have to deal with

“Slipping” can mean that one or several systems involved in servicing a user request are slow. It may be that one or several links of the transformation pipeline distributed on many machines are working slowly. “Slippage” is a difficult case, in part because the very formulation of the problem does not help much to find the problem. Partial refusals - the ones that are not usually displayed on the graphs that you used to view - are hidden in secluded corners. Until the degradation becomes completely obvious, no one will allocate sufficient resources (time, money, tools) to solve such problems. Dapper and Zipkin tools appeared to combat them.

Implement backpressure control throughout the system

Backward flow control is a fault signal from the service system to the requesting, as well as methods for handling these failures by the requesting system, designed to eliminate the overloading of both the requesting and servicing systems. Design taking into account control of the return flow implies restrictions on the use of resources during periods of overload and system failure. This is one of the pillars of creating reliable distributed systems.

Common versions of the algorithm include discarding new messages (with a parameter increase by one) if system resources are already scrolled to the limit; however, error messages are sent to users if the system determines that the request cannot be processed within a specified period of time. It is also advisable to use delays and exponential delays on connections and requests to other systems.
If there is no backflow control, then the risk of avalanche-like failures or unforeseen message loss increases. If one system is not able to handle the failures of another, then it usually propagates the failures to the next system dependent on it.

Seek ways to ensure partial availability

Partial availability is the ability to return some of the results, even if individual nodes in the system fail.
In this case, the example associated with the search is particularly relevant. Search engines always have a compromise between the accuracy of the output and the duration of user waiting. In a typical search engine, a time limit is set during which it can search through documents. If this limit is exhausted before it can view the entire array of documents, the system returns all the results that it was able to find. In this case, the search becomes more convenient to scale in the conditions of episodic slippage, and errors caused by such failures are interpreted in the same way as the inability to search all documents. The system allows the return of partial results to the user, and the flexibility of the system increases.

Also consider the possibility of sending private messages in a web application. It is logical that if the exchange of personal messages is temporarily not working, the function of downloading images to your account should probably remain available. What prevents to realize the possibility of partial failure in the messaging system as such? Of course, there will have to move your brains. Usually, users are easier to relate to the temporary failure of personal messaging services from them (and, perhaps, from other users), than to the fact that some of their messages can be lost. If the service is overloaded, or one of the machines servicing it has failed, it is preferable to temporarily disconnect a small part of the user audience from the whole service than to lose the data of a much larger part of customers. It is useful to be able to recognize such compromises associated with partial accessibility.

Track metrics - otherwise the task can not cope

Checking metrics (in particular, delay percentages, increasing counter values for certain actions, change dynamics) is the only way to reduce the gap between your assumptions about how the system works and reality. The idea of how the behavior of the system on day 15 differs from its behavior on day 20 — this is what distinguishes successful development from fruitless dances with a tambourine. Of course, metrics are necessary to understand the problems and behavior of the system, but metrics alone are not enough to understand what to do next.

Let's talk about logging. Having log files is good, but they often lie. For example, it is very common that only a few classes with errors account for the vast majority of messages in the log file, but in fact these errors relate to a very small fraction of requests. Since in most cases the indication of successful operations in the log file is redundant (such a mass of information would simply have blown up the disk), and since engineers often do not guess what kind of junk classes it would be advisable to output in the log, such files are overwhelmed with strange stuff. Organize logging as if these files were to be read by a person unfamiliar with the code.

I have had to deal with numerous downtimes that occurred only because another developer (or
I myself) attached too much importance to some strangeness seen in the log, without first checking it with the metrics. I (and colleagues) also had the opportunity to identify a whole class of erroneous behaviors based on just a few lines in the log. But note that a) such successes are remembered because they are very rare and b) no Sherlock Holmes can cope with the problem if the necessary metrics and experiments are missing.

Use percentiles, not mean values.

Percentiles (50th, 99th, 99.9th, 99.99th) are more accurate and informative than the average in the vast majority of distributed systems. Using the mean means that the metric being evaluated corresponds to a bell-shaped curve, but in practice such a curve describes very few metrics that the developer may actually be interested in. People often talk about the “average delay”, but I have never come across a distributed system, the structure of delays in which would resemble a bell-shaped curve. If the metric does not correspond to this curve, it is senseless to work with averages, this leads to a false understanding of the situation and wrong conclusions. To avoid getting bogged down with such problems, work with percentiles. If pertishines are used by default in the system, you will get a much more realistic idea of it.

Learn to evaluate system power

In this case, you will find out how many seconds are in a day. Knowing how many machines are required to complete a task is what distinguishes a durable system from the one that needs to be replaced three months after the start of operation. Worse, some systems have to be completely changed at the design stage.

Let's talk about tweets. How many id tweets can fit in the memory of the most ordinary machine? For example, on a typical machine, 24 GB of memory, 4-5 GB of them are on the OS, and a couple more gigs are for processing requests. The size of the tweet id is 8 bytes. Such approximate calculations we do often. In this case, it is convenient to navigate through the presentation of Jeff Dean " The numbers that everyone should know ."

Infrastructure rolls out by switching properties

Special flags for alternate inclusion of properties is a common tool with which developers roll out new functions into the system. Typically, such flags are associated with A / B testing on the client side, where they are used to show a new design variant or a new feature only for a certain part of the user audience. But these switches are a powerful mechanism for replacing infrastructure.

Let's say you go from a separate database to a service that hides the details of a new data warehouse. In this case, the service should wrap the inherited storage and gradually increase its write operations. When backfilling, comparative checks when reading (another switch) and slowly increasing read operations (another switch), you will feel much more confident and will be able to eliminate most of the disasters. Many projects had to be buried, as they underwent a “big outage” or a series of “big outages”, and the cause of such downtime was forced kickbacks due to bugs discovered too late.

A normal programmer, educated in the spirit of the PLO, or a novice who has been trained “properly”, all this switching of functions will seem a terrible confusion. Working with switching capabilities, we recognize that serving multiple versions of infrastructure and data at once is the norm, not the exception. Very revealing. What works on a separate machine can misfire when solving distributed tasks.

Switching capabilities is best interpreted as a compromise: we are going to a local complication (at the level of code in a particular system) for the sake of global simplification and increased flexibility of the entire system.
Reasonably choose id spaces. The id space you choose for your system will define the contours of the entire system. The more id is required to get a specific item of data, the more options you will have to segment this data. Accordingly, the less id is needed for this, the easier it will be to consume the output of your system.

Consider version 1 of the Twitter API. All operations for receiving, creating and deleting tweets were made taking into account the single numeric id of each tweet. Tweet Id is a simple 64-bit number that is not associated with any other data. As the number of tweets increases, it becomes clear that creating custom tapes and subscription tapes from other users can be efficiently organized if all the tweets of a particular user are stored on the same machine.

But the public API requires that only its id is enough to address any tweet. To distribute tweets to users will have to create a search service. One that allows you to find out which user owns which id. If necessary, this is possible, but the costs will be substantial.

An alternate API might require a user id for each tweet search operation and simply use it to store the tweet id information until the data store, segmented by user, went online. Another option is to include the user id in the tweet id, but in this case the id of the tweet would no longer be numeric and would not succumb to k-sorting.

Keep track of what information you encode in your id, explicitly and implicitly. Clients can use the structure of your id to de-anonymize private information, as well as index your system in ways unforeseen for you (automatically increasing id by one is a typical sore spot), and also do a lot of other things that will be completely unexpected for you.

Benefit from data localization

The closer the processing and caching of your data is located in the long-term storage, the more effective such processing will be, and the easier it will be to ensure consistency and speed of caching. Network failures and delays are not limited to dereferencing pointers and fread (3) .

Of course, data localization implies a certain location not only in space, but also in time. If a set of users almost simultaneously perform the same cost requests, then perhaps all these requests will turn into one. If many requests for similar data are made close to each other, then they can be combined into a large request. In such cases, it is often possible to reduce communication costs and it becomes easier to cope with faults.

Do not write cached data back to long-term storage.

This happens in various systems much more often than you might think. Especially in those whose designers are not particularly savvy in distributed development. Many systems that you inherit will suffer from such a deficiency. If the author of the implementation talks about caching on the principle of nested doll, then it is likely that you will stumble upon very noticeable bugs. This item could easily be excluded from the list, if not for my particular hatred of such errors. Their most common manifestation is that user information (nicknames, email addresses and hashed passwords) are mysteriously reset to the previous value.

Computers are capable of more than you expect from them.

Nowadays, there is a lot of misinformation about the capabilities of computers. It is spread by practitioners, but relatively inexperienced specialists.

As of the end of 2012, the lightweight web server had 6 or more processors, 24 GB of memory and more disk space than you are able to use. A relatively complex CRUD application in a modern language runtime environment on a single machine can easily perform thousands of queries per second (or rather, hundreds of milliseconds). And this is the lower limit of possibilities. With regard to operational capacity, processing thousands of requests per second on a single machine in most cases is not supernatural.

It is also easy to achieve higher performance, especially if you want to profile your application and refine the performance based on the measurements made.

Use the CAP theorem to critique systems.

You cannot build a system from the CAP-theorem. This theorem is not suitable as a basic principle from which one could derive a working system. In its pure form, it has a very general formulation, and the spread of its possible solutions is too great.

However, it is very good for criticizing the design of distributed systems and for understanding which compromises will have to be made. If we take the design of the system and compare it in detail with all the restrictions imposed by the CAP theorem, then in the end all the subsystems will be designed better. Here is your homework: consider all the limitations of the CAP theorem on the example of a working implementation of caching on the principle of matryoshka.

One final note: from C, A and P you cannot leave only CA.

Retrieve Services

The term “service” here means “a distributed system that includes higher-level logic than a storage system and usually has an API that operates on a“ request-response ”principle.” Watch for code changes, which would be easier if the code worked in a separate service, and not on your system.

The extracted service provides all the advantages of encapsulation, which are usually associated with the creation of libraries. However, it is preferable to extract the service than to create libraries, because in this case any changes are deployed faster and easier than if you had to update the libraries on the client systems. (Of course, if it is difficult to deploy the extracted service, then the deployment of client systems is easier). Such ease is explained by the fact that in a relatively small extracted service there will be less code and operational dependencies; In addition, the service delineates clear boundaries, not allowing “cutting corners”, which is permissible when working with libraries. Such "corners" always complicate the transfer of internal components of the system to new versions.

When multiple clients are operated at once, the costs of coordination with the use of the service are also much lower than when using a shared library. Updating the library (even when you do not need to change the API) requires you to coordinate the deployment on all client machines. In addition, updating the library requires more active communication with colleagues than deploying a service if different organizations are responsible for supporting client systems. It is surprisingly difficult to convince all participants of the need for an update, as their priorities may not coincide with yours.

A textbook example of this kind is the need to hide the level of data storage at which changes are required. The extracted service has an API that is more convenient to use and also has a smaller “surface” than the storage level of the data with which it fits. When retrieving a service, client systems do not need to know anything about the difficulties of slow migration to a new repository, or the format used. Bugs have to look only in the new service - and they will definitely be found, given the new layout of the repository.

When working with distributed systems, many more operational and social issues should be considered. But this is a topic for another article.

Source: https://habr.com/ru/post/262807/

All Articles