📜 ⬆️ ⬇️

Replicated object. Part 1: Introduction

Preface . This publication is the author's translation of his own article. Therefore, if you find an error in the translation, it may well be that the error is, in fact, in the original article.

annotation


  1. There is suffering.
  2. There is a cause of suffering.
  3. There is a cessation of suffering.
  4. There is a path leading to deliverance from suffering.

4 noble truths of buddhism

This article contains a description of an early prototype that introduces the concept of a replicated object or replob for short . Such an object is a further rethinking of the struggle with the complexity of the code that arises when programming distributed systems. Replob eliminates dependency on a third-party service and implements a consistent change to any user objects representing the corresponding data and functionality. This idea is based on the use of expressiveness of the C ++ language and an object-oriented approach, which allows using complex logic within distributed transactions. This makes it possible to significantly simplify the development of fault-tolerant applications and services. Subsequent articles will explain in more detail the developed approach.

Introduction


WARNING . Almost all the methods mentioned in the article contain dirty memory hacks and abnormal use of the C ++ language. So, if you are not tolerant of such perversions, please do not read this article.
')
At the moment, the topic related to distributed systems is one of the most interesting, and attracts a large number of people, including developers and scientists. The popularity is explained simply: we must create reliable fault-tolerant systems that provide a safe environment for performing various operations and for storing data.

At the same time, maintaining the consistency of a distributed system plays an important role. Ensuring consistency of a high level is given a considerable price. Today there are a number of solutions that provide the weakest form of consistency: the so-called consistency in the long run (eventual consistency). On the one hand, such solutions have relatively good performance, but on the other hand, they cannot be used in many areas where it is necessary to have transactional semantics of operations. The fact is that it is much easier to think about the system using one of the strong levels of consistency like strict consistency or linearizability . Such levels of consistency make it much easier to develop a reliable application with secure semantics for the execution of a sequence of operations.

Overview


As life shows, happiness is less dependent on external things than most believe.

Warren cowper

For the development of a distributed system most often use specialized services. These services should provide a convenient way to deal with the complexity associated with the asynchronous nature of distributed tasks and with various types of failures, including network problems, application crashes, and hardware failures. In a distributed environment, these problems should not be regarded as something out of the ordinary, but should be treated as things quite ordinary and ordinary. Thus, the task of creating a reliable and consistent service for solving problems arising in distributed systems appears on the scene.

Modern systems use fault-tolerant services such as Zookeeper (mostly) or etcd (under active development). They use distributed consensus algorithms: Zab (Zookeeper) or Raft (etcd) to ensure that operations are linear. The idea here is as follows. In the first step, a leader is elected, then the appointed leader (master) records the messages in a certain sequence, which ensures the necessary level of consistency. Although the Zookeeper documentation claims that Zookeeper implements an approach using a primary backup, rather than a state machine replication , it is clear that the only difference between these approaches is that the primary backup is based on the order specified by the replicas, and replication finite state machine is based on the sequence given by the client. I think that it is important here that both approaches agree on the sequence of deterministic operations using the developed consensus algorithms based on the wizard .

Discussion of existing approaches


It should always be remembered that we cannot control events, but must adapt to them.

Epictet

The lack of a wizard based distributed consensus algorithm is obvious: it takes a certain period of time to process the state associated with the fall of the master. The timeout for detecting the fall of the master can not be very small, because it can have a negative impact on performance due to the high probability of choosing a new master. At the same time, the timeout cannot be very large due to a significant increase in the delay in processing the master's fall. Thus, we, in fact, have some compromise between the delay in processing messages and the probability of re-election of the master, which, in general, depends on the network conditions and the performance of cluster machines. At the same time, the performance of the consensus algorithm strictly depends on the survivability of the wizard, and sometimes it takes considerable time to recover and maintain data consistency. Such logic requires at least several exchanges of network messages, the recording of incomplete operations, and this process does not guarantee convergence for any period of time even in the absence of crashes, because each participant can declare himself as a new master. Thus, for some operations, the system may not be available for a relatively long period of time:
  1. Chubby : Most problems lasted about 15 seconds or less, 52 of which were around 30 seconds.
  2. MongoDB : the time varied, but the replicas were chosen by the master for a minute ... During the selection of the master, the cluster was not available for recording.
  3. Zookeeper : The new leader was elected after 15 seconds or so, and the recording continued again. However, only clients who had access to one of the nodes [n3 n4 n5] could write, and clients connected to the nodes [n1 n2] completed their processing with a timeout while trying to connect with the leader.

Transactional semantics and nontrivial scenarios


Logic is the art of making mistakes with self-righteousness.

J. U. Krach

Using transactional semantics for nontrivial logic is one of the most difficult problems. Let's assume that we have a secure repository like the Zookeeper, and we would like to perform the following sequence of operations:
  1. Download some of the data from the repository to the process memory for work.
  2. The use of nontrivial logic for data processing and obtaining the result.
  3. Saving the result in the repository.

This scenario can be solved by applying several approaches.

Pessimistic blocking


Pessimistic locking is based on explicitly locking a used resource, similar to the mutex approach for multi-threaded applications. The task above can be solved by applying the following sequence of operations:
  1. Getting an exclusive lock for the operation.
  2. Perform the operations described above (load, apply and save).
  3. Unlocking.

The disadvantage of this scheme follows directly from the requirement of exclusivity of access:
  1. Exclusive locking significantly increases the waiting time when locking / unlocking is performed. That, in turn, worsens the total time of operations.
  2. In the case of a process crash in the middle of performing operations, we can potentially get inconsistent data (fortunately, Zookeeper has the functionality to atomic apply several operations at the unlock stage). This requires additional time to detect the fall of the process and the subsequent unlocking of the resource, which increases the total time of such an operation.

I would like to emphasize that systems like Zookeeper do not have explicit locking and unlocking features. For pessimistic blocking, you need to use a special recipe , but it introduces an additional delay for this kind of transaction (see also: Addressing the ZooKeeper Synchronization Inefficiency ).

In this regard, a different way of solving the problem appears on the scene.

Optimistic blocking


The optimistic scheme attempts to circumvent the performance problems of the previous approach. The idea is to check the actual state of the data before recording the operations:
  1. Load the status of current data from the repository.
  2. Local application of nontrivial logic and the creation of a set of operations for writing.
  3. Atomic checking that no other transaction has changed the data, and fixing a set of operations for writing.
  4. If the check failed => repeat operations, starting with the first step.

All actions in the third step should be executed atomically, including checking and fixing changes. Such a scheme can be implemented using an incremental version counter: for any successful update operation, we increment the counter by one. The idea is to use the “compare with exchange” operation , which atomically checks that the data version has not changed, which means that the data itself has remained unchanged.

However, this scheme has drawbacks:
  1. The complexity of implementation : the service must implement the operation "comparison with the exchange" and fixing the packet recording, and you must be able to perform these two operations in an atomic manner.
  2. High cost with high competitiveness : with a considerable number of simultaneous updates, the algorithm requires repeating steps from the very beginning, thereby wasting resources due to conflicts that arise due to frequent changes in data.

In addition, for pessimistic and optimistic locking, we must serialize our internal data in the hierarchical key space of the corresponding system (for example, Zookeeper "znodes" or etcd "nodes" ). All the facts mentioned complicate the application being developed, and the approach becomes subject to various kinds of errors. Therefore, I would like to go in a completely different direction.

Concept of replicated object


The complex must be approached simply, otherwise we will never understand it.

Jiju krishnamurti

Let's take a step back and recall object-oriented programming (OOP). Here we have the concept of objects . Each such object owns certain data representing the state of the object. In this case, an object contains a set of methods that transform an object from one state to another.

Thus, the idea is to replicate actions ( object methods ) between the nodes of a cluster instead of replicating the data itself ( object state ). These actions deterministically change the state of the object and create the illusion that the object itself is replicated. At the same time, linearizability ensures that all replicas accept the same sequence of actions, thus obtaining a consistent state of the distributed object under consideration. This is very similar to the state machine replication model. The only difference is that I use a normal object to represent the state and methods to represent events that transform the object. This mapping greatly reduces the complexity of development and allows you to use the power of the C ++ language, since it initially supports the use of OOP without bloatting code.

Properties of the replicated object


My replicated object (or just replob ) has the following properties:
  1. Built.
  2. Without a master.
  3. Storage in memory.
  4. Linearized consistency.
  5. FIFO warranty for the process.
  6. Fast local readings.
  7. Competitive flexible distributed transactions.
  8. Support for the option of independent parallel transactions.
  9. Support for any native data structures.
  10. You can customize CAP.
  11. Smooth degradation of a set of replicas.
  12. Security and survivability with various network problems:
    1. Network connectivity violation.
    2. Partial violation of the network connectivity type "bridge".
    3. Temporary network instability.
    4. Partial direction of network packets.


Below I will briefly review each item.
In case of verbosity, sin cannot be avoided, but its mouth restraint is reasonable.

Ecclesiastes

Built . This is not a standalone service. The functionality works within the user process, which reduces the latency of operations by reducing the number of network messages between replicas. This approach completely eliminates external dependency on services like Zookeeper or etcd and uses native interfaces, which seriously simplifies interaction with replication logic, making it completely transparent to the user.

Without a master . The algorithm does not have a dedicated master (leader). Thus, each node is indistinguishable from each other. This significantly reduces delays in recovery from crashes, and also creates more predictable behavior in most cases.

Storage in memory . The current implementation does not have a persistent layer, and each element is distributed among the replicas inside the process memory. The algorithm, however, allows you to add a persistence property for objects.

Linearized consistency . The algorithm of replicated objects provides a guarantee of linearizability of operations.

FIFO warranty for the process . For the specified process, all operations will be completed in order of their planning by this process (FIFO-order).

Fast local readings . Special mode allows you to read data locally by reducing the consistency level to consistent consistency. This significantly reduces latency and overall system load.

Competitive flexible distributed transactions . Inside a transaction, you can use a deterministic sequence of operations of any degree of complexity. Such transactions are handled in a competitive manner.

Support for the option of independent parallel transactions . A user may have multiple instances of a consensus implementation for parallelizing independent transactions.

Support for any native data structures . A developer can use any standard containers, such as std::vector , std::map , etc., as well as boost::optional , boost::variant or other data structures that support the copy semantics.

You can customize CAP . The user can choose between linearizability and availability in cases of network connectivity violations.

Smooth degradation of a set of replicas . The system maintains consistency even under conditions where the number of replicas decreases significantly. For example, the number of replicas can be reduced from five to two, and in some situations even reduced to one replica.

Security and survivability with various network problems . There are a number of different network problems (see Aphyr: The network is reliable ). In this case, the algorithm preserves consistency and performance in these cases.

All of these points will be discussed in detail in subsequent articles.

Examples


You can be invincible if you do not enter into any battle in which victory does not depend on you.

Epictet

To demonstrate all the flexibility and power of the approach, I will consider a fairly simple example.

Example: key-value store


Let's implement replicated storage with the following interface (I omit the std:: and boost:: namespaces for short):

 struct KV { optional<string> get(const string& key) const; void set(const string& key, const optional<string>& value); private: unordered_map<string, string> kv_; }; 

For simplicity, I chose a symmetrical interface. set method removes the corresponding key if an empty value was passed. When using a regular object, the corresponding implementations may be as follows:

 optional<string> KV::get(const string& key) const { if (kv_.count(key) == 0) return {}; return kv_.at(key); } void KV::set(const string& key, const optional<string>& value) { if (value) kv_[key] = *value; else kv_.erase(key); } 

Now I would like to convert our regular object into a replicable object . For this, I just add:

 DECL_REPLOB(KV, get, set) 

Hint
Hint: The implementation of the DECL_REPLOB is as follows:

 #define DECL_REPLOB DECL_ADAPTER 


And then I can use the following line of code to replicate my replica set data:

 replob<KV>().set(string{"hello"}, string{"world!"}); 

After completing the KV::set call, all instances of the KV type from the replica set will contain the specified pair. Note that an instance can be referenced through the KV type, which means that each replica contains its own single instance of this object.

To read data with a linearized consistency level, you should write:

 auto world = replob<KV>().get(string{"hello"}); 

But to improve the performance for this read operation, I just write:

 auto localWorld = replobLocal<KV>().get(string{"hello"}); 

Like this!

Transactions


Let's assume that I want to change the value to the specified key. The naive way is to write such code:

 auto world = replobLocal<KV>().get(string{"hello"}).value_or("world!"); replob<KV>().set(string{"hello"}, "hello " + world); 

The problem here is only that successive two atomic operations do not give a total atomic operation (the so-called state of the race of the second kind ). Thus, we need to put all our actions inside a transaction:

 MReplobTransactInstance(KV) { auto world = $.get(string{"hello"}).value_or("world!"); $.set(string{"hello"}, "hello " + world); }; 

Then all the specified actions will be applied on all replicas atomically.

Results Transactions


Consider the following task: it is necessary to calculate the size of the value for the specified key. There is nothing easier:

 // use local instance because we do not need to update the object auto valueLength = MReplobTransactLocalInstance(KV) { return $.get(string{"hello"}).value_or("").size(); }; 

The same approach can be used to modify the object:

 auto valueLength = MReplobTransactInstance(KV) { auto world = $.get(string{"hello"}); $.set(string{"another"}, world); return world.value_or("").size(); }; 

All specified operations are applied on replicas atomically.

Multiple replob transactions


Let's assume that we have two independent instances of key-value storages: KV1 and KV2 . We can combine operations for different instances into one transaction using the modifier MReplobTransact :

 // the first transaction is distributed // performs value copying from KV2 to KV1 for the same key MReplobTransact { $.instance<KV1>().set( string{"hello"}, $.instance<KV2>().get(string{"hello"})); }; // the second transaction is applied locally // returns total value size calculation for the same key auto totalSize = MReplobTransactLocal { auto valueSize = [](auto&& val) { return val.value_or("").size(); }; return valueSize($.instance<KV1>().get(string{"hello"})) + valueSize($.instance<KV2>().get(string{"hello"})); }; 

Should I mention that all these actions are performed atomically, with the first transaction being distributed and running on all replicas?

Advanced example


Let's look at the iteration through the collection using a user-defined function:

 struct KV { optional<string> get(const string& key) const; void set(const string& key, const optional<string>& value); // generic method to iterate through the collection template<typename F> void forEach(F f) const { for (auto&& v: kv_) f(v); } private: unordered_map<string, string> kv_; }; 

Now the task is to return the total row size for all values:

 auto valuesSize = MReplobTransactLocalInstance(KV) { size_t sz = 0; $.forEach([&sz](auto&& v) { sz += v.second.size(); }); return sz; }; 

As you can see, everything is implemented quite straightforwardly.

Further directions


If you know in advance what you want to come to, then steps in this direction are not an experiment at all.

Jiju krishnamurti

Earlier, I looked at some simple, but rather interesting, in my opinion, examples of how to use the replicated object approach. Subsequent articles will introduce, step by step, the used ideas and concepts:
  1. Divine adapter.
  2. Non-blocking synchronization without interlocking or a subject-specific model .
  3. Homogeneous actor model or functor model .
  4. Super-generic serialization.
  5. Behavior modifiers
  6. IO and coroutines.
  7. Consistency and applicability issues of the CAP theorem.
  8. Phantom, replob and consensus algorithm without a master .
  9. Examples of implementation:
    1. Atomic failure detector.
    2. Distributed scheduler.

findings


Maturity is the transition from reliance on others to self-reliance.

Frederick Pearls

We have considered the introduction of a built-in failover distributed replicable object that has many unusual properties. These properties can significantly reduce the complexity of creating a reliable distributed application, and opens up new horizons for the application of such an object in a wide range of emerging tasks.

The algorithm for reaching a consensus without a master allows you to handle various failure situations in a predictable way without significant loss of time. The built-in approach eliminates network delays when reading data. At the same time, the strong consistency model provides a convenient way of interacting with a replicable object, and also allows using it as flexibly as possible within distributed transactions.

I want to express my special thanks to Sergey Polovko , Yauheni Akhotnikau and Petr Prokhorenkov for helpful comments and advice.

Questions for self-test


The only difficulty is to ask the right question.

Frederick Pearls

  1. How is DECL_REPLOB implemented?
  2. What is the difference between local and non-local operations?
  3. Is it possible to implement a consensus algorithm without a master?
  4. List all the behavior modifiers mentioned in the article.

Bibliography


[1] Documentation: Zookeeper .
[2] Documentation: etcd .
[3] Article: Zab: High-performance Broadcast For
Primary-Backup Systems
[4] Article: In Search of an Understandable Consensus Algorithm
(Extended version)
[5] Zookeeper Documentation: Zab vs. Paxos
[6] Article: The Chubby Lock For Loosely-Coupled Distributed Systems .
[7] MongoDB Documentation: How long does replica set failover take?
[8] Aphyr blog: Zookeeper .
[9] Documentation: ZooKeeper Recipes and Solutions: Locks .
[10] Article: Addressing the ZooKeeper Synchronization Inefficiency .
[11] Documentation: Zookeeper znodes .
[12] Documentation: etcd nodes .
[13] Wikipedia: State Machine Replication .
[14] Aphyr blog: The Network Is Reliable .
[15] Article: ZooKeeper: Wait-Free Coordination For Internet-Scale Systems .

Source: https://habr.com/ru/post/267509/


All Articles