📜 ⬆️ ⬇️

How does the search in Kad Network

Quite often there are complaints that there is no keyword search in Mainline DHT bittorrent. Usually requests to add such a search on the forums BitTorrent Inc. ignored or received the traditional answer that DHT does not allow searching by keywords, but only by individual keys. This is true in principle, but there is such a way out of this regrettable situation as creating an inverted index of keywords for each node.

Actually, the search in Kad Network, the implementation of a distributed hash table based on the rather widely used Kademlia protocol, works. Kad Network is used by programs such as eMule, iMule, aMule and MLDonkey to search for file hashes by keywords and file sources by their hashes.

General description of the search process in the Kad Network.
(for a better understanding, it is useful to understand how Kademlia works)

So, every node on the Kad Network has a 128-bit ID generated randomly each time it logs on to the network. For hashing files and words, an algorithm based on MD4 is used, which gives the distribution of key hashes in the same space as the ID of all nodes. Each node is responsible for storing several types of information. First, each node stores a source map, i.e. information about network members who are willing to share a specific file. A map is a hash of a file as a key and a list of sources as a value. The node source map contains contact information on the hashes of all files whose hash matches or is close to the node ID (XOR-metric is used). So, having a file hash, you can find nodes with IDs close to the file hash (a specific search mechanism roughly corresponds to what is used in Kademlia) and request a list of sources from them for downloading the file. Secondly, if there is no file hash, you can search for it by keyword. To do this, the name of the published file is divided into several consecutive words, and each word is hashed separately. In order to search for these keywords, each node has a second card that contains hashes of words close to the node's ID, as keys and file hashes, and their names as values.
')
Now in more detail how search is carried out specifically in eMule.

When the search process is initiated, the corresponding object is created and entered into the search card. Each search query has a target key (file hash, word hash) that is searched on the Kad Network. Search queries are carried out using two parallel processes.
The first process searches for nodes with IDs close to the key-hash of the desired object. To find the necessary nodes, the searching node takes 50 nodes with the ID closest to the key of the desired object from its routing table and places them on a temporary map. Then the searching node starts to refer to these 50 nodes in order to get the nodes that are closer to the key of the desired object. Only 3 nodes are polled at a time. The seeking node sends a KADEMLIA_REQ message with parameters such as the object key (word hash, file hash or node ID) and the number of requested nodes (i.e., how many nodes the requested node should return as a response). The last parameter depends on the type of search query. If it is a node search, then the parameter is 11, if it is a source search or a word search, then - 2. When a node receives such a message, it responds with a KADEMLIA_RES message with the requested number of contacts (node). When the searching node receives this message, it adds the received contacts to its routing table and to the temporary map of the current search. Then the searching node again polls the three nodes closest to the object's key, but only if their IDs are closer to the key than the ID of the nodes that provided their contact information. This procedure is repeated until there are new nodes in the temporary map whose ID is closer to the key of the object to be searched for. If the nodes do not respond to the KADEMLIA_REQ request, they are deleted.

The second process sends the nodes that responded with the KADEMLIA_RES message, starting from the closest one, to the corresponding search query, depending on what is being searched. This can be SEARCH_REQ for searching sources or files by keywords, PUBLISH_REQ for publishing sources or keywords, etc. In these requests, an additional flag is used, taking values ​​of 0 or 1, to determine what is being requested - sources or hashes with names by keywords. When a node receives such a search request, it must process it (look for the requested information in its keyword map or source map) and respond with a SEARCH_RES message with the selected results. For each type of results there is a limit value at which the search is terminated. For a source search or a keyword search, this limit value is 300 received original objects. Also, the search for sources or files is stopped if 45 seconds have passed since the beginning of the search.

Useful articles:
Petar Maymounkov, David Mazières "Kademlia: A Peer-to-Peer Information System Based on the XOR Metric".
Detailed description of the Kad Network protocol:
David Mysicka "Reverse Engineering of eMule: An Analysis of the Implementation of Kademlia in eMule".
Stefan Schmid, Thomas Locher "When KAD meets BitTorrent - Building a Stronger P2P Network".

Source: https://habr.com/ru/post/187470/


All Articles