How I realized what distributed systems are

Hi, Habr!

Soon we will have an exquisite novelty for top-class developers - “ Reactive design patterns ”.

The author of the book Roland Kun is a star of the first magnitude in the field of distributed systems, one of the Akka developers. Under the cut, we offer the translation of his programmatic article on distributed systems and the actor model posted on the GitHub website.

When people ask me how I like the actor model, I usually answer: “It accurately models the distribution”. This means that it specifically describes what is distributed computing, it does not contain tinsel, and important characteristics are not ignored. In this article I wanted to tell you what I learned about distributed models; I hope, readers will also be interested.
')
Disclaimer: maybe someone has already set out all these important moments back in the 80s. Sorry, there was no time for a detailed study of this aspect, I prefer to study on my own.

## Beginning: Untyped Akka Actors

I participated in the revision of Akka between versions 1.3 and 2.0. During this period, we made some fundamental changes both in the device of the instrument and in the set of guarantees it provides. With all the changes, they used the golden rule: “if it cannot be guaranteed that this feature will certainly work in a distributed context, then we will not do it”. We intuitively understood what is distribution, in the spirit of:

The sender and recipient of the message may be located in strongly remote systems (from the point of view of the delay in communication), so there is not much point in knowing that "some kind of request has already been processed" because
All communication is unreliable: any messages can be lost and delayed, and processes (actors) can fail independently of each other, and this can occur both on one machine and in different parts of the network.

At that time, the Akka development team was deeply rooted in the belief that it was problematic to achieve consistency among all actors, and the need for such coordination should be avoided if possible. It takes a lot of time to achieve consistency and, when it is installed, the elements of the system can already go far in work from the “coordination point”, spoiling the result, unless the processes are programmed taking into account the “signal propagation time”.

How can a user API be arranged, as well as individual capabilities under such strict restrictions? In the actor model, three characteristics are defined and the fourth is implied (we will return to this implied characteristic below):

sending messages
behavior change as messages arrive (i.e. sequential processing)
creation of additional actors

Akka implements all three of these capabilities, but in its own way provides guarantees for the delivery of messages; instead of reliable delivery, optional; we prefer that the user himself decides what level of reliability he needs. That is, we believe that without (redundant) persistence, delivery will not be truly reliable, since a single power failure is enough - and there will be no talk of reliability. However, it is too hard to require long-term storage just to run several local actors. Here there is an important limitation: the one-to-one user API corresponds to the operational semantics, i.e., ActorRef must always act in the same way, it is impossible to provide any substantial reliability by means of configuration alone, since there is no reliability in the ref ! msg ref ! msg not guessed.

Here are some other features offered by Akka.

compulsory parental control (in particular, the lifespan of a child actor is limited to the lifespan of the parent)
life cycle monitoring aka DeathWatch

Going beyond the limits of the actor model is not so easy - it requires achieving a certain consistency within the cluster of nodes where actors are hosted. Only by reaching a consensus on what situation is considered a fatal node failure, one can guarantee that these possibilities will work with the same semantics in all conditions. When I read my newsletter, there is no doubt that one cannot joke with such problems. The question constantly arises: why the nodes are knocked out of the cluster, and why they will not be returned back later (I explain: as soon as the node is declared dead, all its supervision and death watch is canceled, therefore, when resurrecting such a node from the dead, its actors will start to behave oddly .

## First stop: Akka Typed

The interactions between Akka actors are untyped, and this irritated me from the very beginning. Sending a message is mediated by the operator ! in essence, this is a “from general to particular” function — completely unlimited and without any feedback. Lack of feedback is a concession that you have to make when modeling a distributed system, since any exchange of information is expensive. But the limitations associated with typing seem to have been missed by chance, and we observe the same problem, considering how the actor is defined: it is a partial function acting from the general to the particular. Thus, any actor turns into a black box, which may or may not work when you send him a message. Therefore, actors get considerable freedom of action, but it becomes difficult to judge them in isolation from the dynamics of events: it’s as if we are injecting a JavaScript bubble into the type-safe world of Java.

Since version 2.4 of Akka comes with Akka Typed , I’ll clarify my desire for the third time in this way: I would like to limit the types of messages accepted by the actor by allowing the compiler to reject obviously incorrect programs from the threshold. In essence, now the definition of an actor is an integral function describing the path from an input message of a certain type to a particular behavior (limited by the same type). Accordingly, it is now possible and justifiably parameterize the reference to the actor with the same type, rejecting inappropriate input information.

 //  -Scala    dotty type ActorRef[-T] = T => Unit type Behavior[T] = (T | Signal) => Behavior[T] // ,   ,

Such changes are inspired by repeated doping of internal and auxiliary capabilities, but they also allow expressing not just static types of actors: if you include correctly typed ActorRef in messages, then the types that appear in the communication between actors can change over time (at different stages of work).

 case class Authenticate(token: Token, replyTo: ActorRef[AuthResponse]) sealed trait AuthResponse case class AuthSuccess(session: ActorRef[SessionCommand]) extends AuthResponse case class AuthFailure(reason: String) extends AuthResponse

If we simulate the protocol in this way and only provide clients with ActorRef[Authenticate] , then we will not only prevent them from sending messages of a completely wrong type, but also express the session's dependence on successful authentication. After all, without having ActorRef[Sessioncommand] , the compiler will not allow sending such messages.

## Related Topic: Protocols

In the previous example, we use different types of messages depending on which stage of the work is described by the appropriate protocol, and this approach may become overweight in time. Another insidious moment is that it does not statically check how many messages are sent at each stage of work, and the client can not only send the same messages many times, but also roll back to the stage below, saving ActorRef from that stage. Of course, such a scheme will collapse as soon as cycles appear in the protocols that repeatedly use the same type at different points in time.

To cope with these problems, it is necessary to describe multi-stage protocols in the context of their structure. One promising approach is called “ session types ”, but not all questions are answered here. For example, it is still problematic to express the linearity of operations (that is, the impossibility of going back in time and using previously utilized information) in a programming language — whereas for a person such a retrospective seems perfectly logical. One approximation of this kind is the Ichannels library on Scala from Alceste.

## Path to composability

The formulation of the concept of actors consciously implies that there should be a single entry point for all messages. Such an input point can be made on an untyped Akka using context.become(...) , or arrange for each message to be processed on Erlang or Akka Typed. If we compose actors from different-type behaviors (that is, actors work differently depending on the intermediary), then all messages arriving through such an entry point should be demultiplexed - only in this way they will arrive at the correct destination in accordance with the internal logic. This situation is a bit annoying when working with untyped actors, but Akka Typed can turn into a real torment: here you will need cast operations that formulate such behavior that both Authenticate and SessionCommand , but, for example, only the first option is publicly presented. With highly typed logic, strict layout rules are required, apparently this is a universal truth, whether it is a matter of the composition of pure functions or distributed computations.

Alex Prokopets in his presentation at the conference ScalaDays 2016 in Berlin literally opened my eyes, indicating a way out of this dilemma. The very nature of the actor model is such that it requires assigning different versions of the actors (references to them) for different purposes. Of these, it is possible to assemble a larger object that can communicate using different protocols with each of the intermediaries. When creating independent actors, we face the following disadvantage: internal consistency is lost. Actors turn into single inspected islands in a sea of distributed chaos. Therefore, the trick is to virtualize the actor and make several input points in it, each with its own identifier. Alex uses the semantics of stream processing, something like RxJava, but I was immediately attracted to the idea of organizing the internal layout of such composite actors using π-calculus.

The first version of the original session DSL over Akka Typed was quickly created on the basis of a monadic description of a sequential and competitive layout of primitive actions and calculations.

 val server = toBehavior(for { backend ← initialize server ← register(backend) } yield run(server, backend)) private def initialize: Process[ActorRef[BackendCommand]] = { val getBackend = channel[Receptionist.Listing[BackendCommand]](1) actorContext.system.receptionist ! Receptionist.Find(BackendKey)(getBackend.ref) for (listing ← readAndSeal(getBackend)) yield { if (listing.addresses.isEmpty) timer((), 1.second).map(_ ⇒ initialize) else unit(listing.addresses.head) } } ...

The main abstraction is the Process, which ultimately calculates the value of the specified type. For sequential layout, flatMap or map is used, and the fork (process) action is used to create concurrent execution threads. The full resulting process is calculated within the framework of a single actor (the toBehavior function wraps it in a suitable interpreter). It responds to input as it arrives, and this is what it should do: the readAndSeal operation suspends the process until an available message appears on the getBackend channel created as part of the initialize process.
The primitives proposed in the draft version of this library correspond to the actions and compositional possibilities of the π-calculus:

channel creation
sending
getting
sequence
the choice
paralleling

Everything began to improve rapidly, as if in my head a world of perfectly composable reusable behaviors began to emerge.

## Deep (yes, deep) disappointment

My dreams collapsed as soon as I asked the following question. What will happen if an important message - one that opens the path to the next protocol for another element of the Process - is lost for some reason. Ensuring reliable delivery does not eliminate the following problem: other actors can refuse completely “independently” from the first one, and if the recording end of the channel corresponds to ActorRef, then it is logical that the progress of all our work depends on remote systems, because the transparency of the location is a very powerful semantic phenomenon. To fix the problem, we could set an upper limit on the wait time for the receive operation, but in this case we need to wait for the local process to fail. Based on the fact that local processes will also coordinate work through channels, we mean that channels will hang. This is a security issue that is solved with (distributed) garbage collection. (Naturally, the channels may also hang due to a programmer's error; therefore, in any case, it seems reasonable to be safe from a fatal resource drain).

I also realized that, expressing π-calculus on paper, one can isolate and eliminate dead processes in this expression, which, perhaps, will not move anywhere: they are waiting to be sent or received in a channel that is not known to any other process. Such an operation to eliminate dead processes is practically unattainable if the implementation is based on Scala's fuzzy circuit.

But the greatest difficulty is connected with the fact that the defining feature of π-calculus, namely, the ability to exchange channels, is very difficult in practice. On the sender's side, the problem is solved trivially: the channel is simply provided as an ActorRef. However, on the recipient side, it will be necessary to ensure that the sent message is delivered to multiple recipients with only one defining characteristic: at the moment each of them has a link to the channel, and the addressee is ready to receive information. We will need to ensure that, in practice, the message receives one and only one addressee.

It is from such global coordination in Akka that one has to abstain, at least, when working with basic primitives, since it is very costly. It is assumed that the simplest solution should be scaled practically without limitations, in principle - infinite. We strive to get as close as possible to such an ideal. Infinite scalability is not always required, and there are quite real situations in which a consistently coordinated database will suit us, but this does not prevent us from striving for perfection.

## Up the hill

The use of channels could be “corrected” by preventing the recipient from being serializable (that is, not allowing it to be sent over the network) and throwing an exception if the receiving operation is undertaken from the context of an actor that is not suitable for this. Such a solution would be ugly, not only because it strongly contradicts π-calculus, but also because it is not recommended to offer not everywhere certain functions as user APIs, whose function depends on circumstances that are not viewed at the code or type level.

It seems convenient to a person to choose from which channel to read: after all, we communicate this way. We go somewhere and speak with different people in the order required to achieve the goals that we set for ourselves. Actors are forced to deal with the message that comes next, in real life, this would mean "distracted all the time." Unfortunately, the above we have seen that it is precisely the ability to freely choose a channel is the most problematic, so let's consider alternative approaches.

One alternative suggested by Alex, who described the Reactors. The channel API allows streaming and other reactions to be attached. This is better, the possibility is still possible that the link to the channel will be transferred to another reactor, and chaos will reign in the system due to receiving information from the wrong context.

Therefore, I realized how beneficial it is to program the acquisition operation as an implicit API property, one that cannot be easily accessed from user code. This is exactly how such a trap is avoided in the actor model: here we define three options for action that can be taken in response to a message , but the actor is not allowed to actively request incoming messages.

Another alternative is ... actors. Each channel is created based on a behavior that describes how it will respond to incoming messages; in particular, it can change the behavior from message to message. Thus, we continue the dialogue with another actor, creating a channel with a continuation and sending it back to the intermediary. This is how Karl Hewitt and Gal Aga imagined this concept and propagandized it from the very beginning, although they did not consider competition as an integral part of this model.

## Second stop: sub-actors

An improved model can be described as implicit packaging of channel creation operations with process creation operations and with removal of arguments to the read operation — when reading, only one input channel of the process is accessed. With a sequential composition, there is an option with repeated use of the same channel, and with a parallel composition, the results can be transmitted (if necessary) through previously created continuation channels.

The advantage over bare Akka Typed is to get rid of the stencil code needed to create and install continuation operations, as well as to reuse a single actor as a dispatch module. Thus, such fine management is provided without constant exorbitant costs for the creation of actors and interflow communication. It is believed that such a phased description of actor behaviors allows the actors to be reused and linked.

## And it's all?

At the very beginning I mentioned that I understood something about the statement “Actors accurately model distribution”. I just realized how exactly this model works when solving problems: apparently, the actor model in all its details and the semantics of distributed systems are combined almost into one whole. So, in a Wikipedia article on process algebra, it is stated that the π-calculus and the actor model are actually twins, but I don’t think so anymore. In my opinion, π-calculus offers a too large set of properties to allow the creation of infinitely scalable implementations. On this occasion, there have been some new research, in particular, the work of Chris Mejlejon on the expressiveness of asynchronous π-calculus . I also found a very useful FAQ on π-calculus, describing what it should be used for. I believe that the π-calculus tools are interesting for the formal description and verification of the protocols - and, in this case, the expressiveness of this calculus is not fully used - and that based on such a calculus (without adaptation) you cannot write an API for the end user.

On the other hand, I did not find how to fully formalize the actor model in the form of calculus (in the sense that it is not possible to mathematically adequately convert it to other types of calculus and so comply with all equivalence and congruence relations so that all beautiful theorems continue to be fulfilled). It is quite possible that we will need not such formalization, but restrictions on the behavior of actors described in external protocols, raising them to the source level using code generation (either encoding the entire session, as done in Alceste, or generating appropriately related classes of messages depending on what level of security is achievable on a specific end-user API). Or, you need to extract the behaviors of the actors into an abstract tree of behaviors that can be represented as π-calculus processes and analyzed outside the tree. It seemed to me that from the conceptual point of view the most difficult thing is the following: the primitive effect of unreliable sending a message with arbitrary delivery corresponds to a nontrivial π-process, which leads to combinatorial blasts that drastically reduce the number of possibilities; (?) : , .

– - , . – .

Source: https://habr.com/ru/post/344538/

All Articles