Kademlia DHT: Basics

Hello!
In this article, as well as, I hope, in the subsequent ones, I want to talk about one of the modern structured peering networks. This material includes my processing of documentation, descriptions and articles found on the topic. As an introduction, a general brief theory of p2p networks, DHT, is presented, and only then follows the main part to which the note is devoted.

1. General theory of p2p

The distribution of data, processor time and other resources among users is forced to seek solutions for the organization of networks of different levels and interaction.
The client-server interface is replaced by p2p (point-to-point) networks in order to provide a greater level of scalability, autonomy, anonymity, and fault tolerance.
Next will be a general classification.

The degree of centralization can be divided into the following types of p2p networks:

Centralized
Hybrid
Fully decentralized

Centralized - Networks that use one or more servers for routing and searching. The infamous Napster is a classic example of a centralized p2p network.
')
Cons of this type of network:

The server represents a network bottleneck. Failure leads to a complete loss of system performance.
Legislative Server Vulnerability
With a significant increase in the network population, the server will undergo a kind of DDoS attack, leading to its failure.

Pros:

The whole process from the search to the receipt of the file goes as short as possible: a search query to the server, the output of the results from the server, the connection to the desired node.
It is quite possible to search not only for exact matches, which is important, as will be shown later.

Hybrid - Networks that contain two types of nodes: general purpose and super nodes (Super Peer). The latter are assigned dynamically depending on certain conditions and allow you to control the routing and indexing of data on the network.

The disadvantages and advantages depend on the degree of “hybridity” - which characteristics it inherits from centralized, and which from decentralized (which will be discussed later).

Decentralized - This type of network implies a complete lack of servers. Thus, the bottleneck is excluded from the network.

Minuses:

Certain routing and search algorithms are needed, sometimes not guaranteeing the reliability of the result.
To be included in such a network, you need to know the coordinates of at least one node; therefore, lists with a certain amount of address data of network participants must be published in publicly available sources (for example, on websites). The process of connection itself implies the bootstrap stage, so such lists are also called bootstrap lists.

Pros:

Excluding the server allows the network to be fault tolerant, even with a rapidly growing / falling number of participants.
A high degree of security from censorship

Decentralized networks can be divided into structured and unstructured. In the first case, the network topology is built according to certain rules, allowing to organize a quick data search, but, alas, only by exact coincidence. Each node, in fact, is responsible for its own data area (how such areas are allocated, their type, the design of routing tables - all this depends on the specific network topology). In unstructured networks, it is not known in advance where to send a request, therefore, in the simplest version, the option of flood requests is used: the node sends the request to its neighbors, its own, etc.). It is important that TTL (Time To Live) is defined for such requests, which allows them to be cut off after a certain number of hops over the network. Obviously, with low TTL, the network will produce false search results (without reaching some sources), and with high TTL, the time for requests, and the amount of traffic may inadmissibly increase. However, unlike structured networks, it is possible to search not only by exact matches, which is important for file-sharing systems. The scalability of unstructured networks, if not absent, is very problematic (the presence of TTL and the increasing time of data retrieval).

When designing the topology and protocols of structured networks, it is considered optimal to perform the following relations:
- The size of the routing table at each node: O (log (n))
- Search complexity: O (log (n))

2. DHT

DHT (Distributed Hash Table) - distributed hash table. This structure is often used for decentralized topologies. However, this is not the only choice (as suggested by the literature, however, it is better to do further independent search here).
For each value (data) on each node, a hash is calculated according to certain rules (for example, using SHA-1), which becomes the key. Also, a node identifier (the same length as the hash, and often the same function) is calculated. Thus, each network node has its own identifier. Keys are posted online. The node is responsible for the keys of the table, which are close to it by a certain metric (i.e. distance), here it means “similarity” of the key to the identifier, if you omit the language of mathematics. Due to this, each node occupies its own area of responsibility when storing keys and related information (coordinates of the node where the value is located). This allows a certain way to organize the search algorithm by exact values (first, calculate the key of the value to search for, for example, the file name, and then search for this key by sending requests to the node responsible for it through intermediaries providing the full path).
DHT fully provides fault tolerance and scalability. In combination with Peer Exchange, for example, DHT allows you to intercept the functions of a Torrent tracker.

3. Kademlia

Metrics

The creators of this topology are Petar Maymounkov and David Mazières. The basis of the protocol and table construction is the determination of the distance between the nodes through the XOR-metric:
d (x, y) = x XOR y. It is important to note that the distance is the result of the XOR operation, interpreted as a number, not the number of different bits (such criteria can also generate a metric in the key / identifier space). Those. with a key length of 4 bits: d (2.5) = 0010 XOR 0101 = 0111 = 7.
The XOR metric satisfies all the axioms of the metric:

d ( x , y ) >= 0 // d ( x , y ) == 0 <=> x == y d ( x , y ) = d ( y , x ) // d ( x , y ) <= d ( x , z ) + d ( z , y ) //
d ( x , y ) >= 0 // d ( x , y ) == 0 <=> x == y d ( x , y ) = d ( y , x ) // d ( x , y ) <= d ( x , z ) + d ( z , y ) //
d ( x , y ) >= 0 // d ( x , y ) == 0 <=> x == y d ( x , y ) = d ( y , x ) // d ( x , y ) <= d ( x , z ) + d ( z , y ) //
d ( x , y ) >= 0 // d ( x , y ) == 0 <=> x == y d ( x , y ) = d ( y , x ) // d ( x , y ) <= d ( x , z ) + d ( z , y ) //

This property is noted in connection with the possibility of formal proof of theses on the size of the routing tables and the complexity of the search. (In fairness, provided in the form of an outline).

Routing table

The algorithm for constructing routing tables is based on calculating the distances between nodes (by applying a metric to their identifiers).
Each i-th column of the table is responsible for storing information about nodes at a distance from 2 ^ (i + 1) to 2 ^ i. Thus, the last column for node 0111 may contain information about any nodes 1xxx, the penultimate one about nodes 00xx, another level closer - about nodes 010x.

Of course, in a real network, the key length is usually used at 128 or 160 bits. Depends on the implementation.

Further, a restriction is introduced on the number of stored nodes at each level, K. Therefore, the columns of the table bounded by such K are called K-buckets (unfortunately, the Russian equivalent sounds quite unintelligible).

If we look at the binary search tree, in the sheets of which the node identifiers lie, it becomes clear that such a K-buckets structure is not random: each node (and by example 110) gains knowledge of at least one participant of each subtree (for 110 highlighted with filled ovals). Thus, each K-bucket reflects the connection of a node with no more than K participants of the subtree at level i.

Protocol

The Kademlia protocol contains 4 types of messages:

Ping
STORE
FIND_VALUE
FIND_NODE

PING is needed to verify the existence of a specific node on the network. For example, it is used when trying to add a node to the K-bucket when it is already full: existing nodes are polled, if there is no answer, the node is inserted. It is worth noting that the older nodes are at the bottom of the K-bucket, this is based on experimental data, indicating that the longer the node remains in the network, the less likely it is to leave. Therefore, they will be interviewed only after the newer ones.

STORE request that allows you to place information on a given node. Before executing a STORE, a node must find the K values closest to the key that it intends to publish, and only then send each STORE with a key and its own coordinates. Storage directly on the K nodes allows you to increase the availability of information.

FIND_VALUE request, which is often repeated to form an iteration, allowing you to find a value by key. Implements searching for a value in a network. Returns either the K nearest nodes, or the value itself, depending on the achievement of the node storing the desired data. The iterations stop either when the value is returned (success), or when the already polled K nodes are returned (there is no value in the network).

The FIND_NODE query used to search for the nearest K to a given identifier (behavior similar to FIND_VALUE, only never returns a value, always nodes!). It is used, for example, when a node joins the network as follows:
- Contact with bootstrap knot
- Sending the FIND_NODE request with its ID to the bs node, further iterative sending
- Update K-buckets (this updates both the K-buckets of the new node, and all those that the request went by (here we can see that the XOR metric is symmetrical))

As you can see, the protocol practically does not cover security issues in its specification, which is reflected in research and work on the attack of Kademlia-based networks.
Of the features, it is worth emphasizing the presence of replication, the lifetime of the value (the need to re-allocate after a certain period of time), parallelism when sending FIND_ * requests in order to reach the necessary nodes in a shorter time (denoted α, in implementations is usually 3). Each time a request passes, an attempt is made to update the K-bucket, which allows keeping the routing tables as optimal as possible.

4. Implementations

First of all, the most famous network based on Kademlia is Kad Network, available at aMule / eMule . Bootstrap uses existing nodes in eDonkey.
Khashmir - Kademlia is a Python implementation used in BitTorrent
Of the now developing and active libraries, I would note
maidsafe-dht - C ++ infrastructure implementation with support for UDT and NAT Traversal .
Mojito - Java library from LimeWire.

5. What's next?

I would like to receive comments on what to go into more detail, because The article is extremely superficial. In order to arouse interest, and not to lay out all the cards at once on the table. I myself plan in the next article about Kademlia to talk about security problems, attacks and their prevention.

6. Materials and links

P2P (ENG wikipedia)
DHT (ENG wikipedia)
Kademlia (ENG wikipedia)
Kademlia: A Design Specification
Kademlia: A Peer to Peer Information System Based on the XOR Metric
Theotokis, Spinellis. A Survey of Peer-to-Peer Content Distribution Technologies
Peer-to-Peer Computing. Principles and Applications. // Springer, 2010

Source: https://habr.com/ru/post/107342/

All Articles