VikiGugloMetaTracker DNS

One of the main obstacles to creating a truly decentralized Internet so much desired (and to me too) is the problem of searching, classifying, naming and organizing sites and files on the network. It is not enough to create an environment for decentralized processing and storage of information, it is necessary to make it so that it is convenient to search for the necessary pieces of data in this environment.

Otherwise, it turns out - for which they fought, and ran into it. Instead of an abundance of information, you get a cesspool. The amount of information on the network is such that the inability to find the necessary data is equivalent to their complete absence. The signal sinks in noise. And it’s ok if the noise is white, but it often has a pronounced color - political or commercial.

Today, the functions of searching and structuring information on the Internet are distributed among several subsystems. They are very different, some of them appeared at the dawn of the Internet, some are booming right now.

Historical excursion from Captain Evidence

The oldest of them is DNS. I often find myself confusing the concepts of “name” and “address” of a site. Although the address answers the question "where?", And the name - the question "what?". IP address and domain name. DNS is a naming system, not a resource addressing.
')
You can look at DNS as a huge, but very concise directory of Internet resources. Everything is clearly laid out in hierarchical shelves, each DNS record contains the name of the site, the name and address (not the fact that they are real) of the owner and some official information that associates the name of the site with its address. Perhaps it was a great idea at the dawn of the Internet, when there were few websites, and they were used almost exclusively by scientists and programmers, who, if anything, could get their bearings by IP address. Now, when the sites account goes to tens and hundreds of millions, and the average network user not only does not remember the IP addresses of favorite sites, but does not even know what an IP address is, in the hands of the DNS administrators, impermissibly huge power was concentrated. Chick! - and there is simply no site. If the site is taken away hosting, with a sufficiently high qualification of admins almost no one will notice anything, the maximum is a few hours of downtime. The loss of a domain is a catastrophe of a much larger scale.

If DNS is a very concise directory, then directories like Yahoo contained much more useful information. But this advantage brought them to the grave. When sites began to appear like mushrooms after rain, maintaining the relevance and accuracy of the catalog has become almost impossible. Catalogs have been replaced by search engines. In them, a cumbersome system of sections, categories and subcategories has given way to a search query, and manual entries are filled with an automatic index.

The search engines of the first generation solved the problem of scaling the “Internet filing cabinet” at the cost of deteriorating its quality. Sly webmasters quickly mastered the techniques of fraud search engines by manipulating keywords on the site. And it was here that the first signs of decentralization in the form of Page Rank appeared on the scene. The brilliant idea to use for ranking search results not only (and not so much) the content of the site itself, but also user behavior, expressed as links to the site from third-party resources, became the basis of Google’s power.

Pay attention to how similar the browser works with DNS, and how the user works with Google. In both cases, there is a line in the input that answers the question “What?”. This line is sent to the processing of DNS-servers or datacenters Google, and they give the answer to the question "Where?" Lies is the same "What?"

But what if we combine the ideal clarity and simplicity of the hierarchical catalog with the scalability and self-organization of the “swarm intelligence” of the masses of users? .. So Wikipedia appeared. It does not seem to be designed for searching, but personally I don’t know a more convenient way to get a selection of high-quality links on any topic than to go over the relevant sections of Wikipedia. One of the basic principles of Wikipedia - articles should not contain original research and information without reference to sources, transparently hints that Wikipedia is just a directory. Simply, it is much more universal and contains descriptions not only of Internet resources, but also of real-world objects, which masks its similarity with DNS and search engines.

In the world of file sharing, the evolution of approaches to searching and filtering necessary files was a little different. In the prehistoric era reigned utter mess. Within their data center or computer, people sometimes maintained some order in accordance with the personal tastes or standards of the company. In general, it was a global file server. The first attempts to bring order to the scale of the entire Internet came across the unfortunate fact that many files were protected by copyright, and their owners were not eager to facilitate the free access to their “charms” ... So Napster died. On its wreckage, more resistant to administrative influence torrent trackers have grown. Fully decentralized file storage and distribution is a much tighter nut for censorship and endangered copywriters. However, the approach “The motor scooter is not mine, I only posted the announced” does not save one hundred percent. They still continue to be pressured by the same DNS and Google (which, although it uses a decentralized approach in calculating PR, but it can “correct” the result of issuance under sufficient government pressure or when large claims are threatened).

Search semi-finished product

So, if we want to create a decentralized Internet, it is not enough (and maybe not so necessary?) To do the P2P version of the DNS. We need both P2P search, and P2P directories, and P2P trackers. At first glance - a very difficult task. But there is one small nuance. Creating a universal and the only true system of searching, classifying and filtering content is not just an overwhelming, but generally unsolvable task. This system simply no one will trust. In a decentralized environment there can be no “only true” solutions. We need competition to develop algorithms, we need a choice so that users do not feel “under the hood”. No matter how paradoxical it may sound, this greatly eases the task.

The details of the implementation of the algorithms of different search engines can be quite different, but they all use approximately the same source data. Typical search queries, grammar of the language, global graph of pages that link to each other, statistics of visits - all this does not depend on a specific engine. Or, if you take the community and social networks with a rating system or karma - a social graph and the history of all the plus signs and minuses in the system. Or torrent trackers - the number of downloads, seeders and liches for each distribution, the volume of download-feedback of each participant. Based on these dry raw data, you can consider PR, TIC, rating, bonuses, recommendations and much more.

As in boxing, when there are many versions and formats of competitions with the same basic rules and technology, all these WBA, WBO and WBC, and in the decentralized Internet, you can provide some basic metadata format, on the basis of which different search algorithms will compete with each other, ranking and filtering. And, just as in boxing, unification fights of champions of different versions can be held, it is necessary to allow the user to choose which algorithm to use for processing publicly available basic metadata. Then you will immediately see who is who. There will be healthy competition between the algorithms. All algorithms will be open, because if the code will be executed distributed on client machines, obfuscate and hide it - Sisyphean labor. This means there will be no ground for suspicion that someone considers unfair or unfair. Anyway, doubt? Go into the settings, switch to another algorithm and check.

Such a modular architecture of search systems, ratings or recommendations is convenient in another case. If we use a lot of subjective opinions of users of the system as source data, it doesn’t matter, indirectly - considering links, purchases, friends, downloads, or directly - considering pluses and minus in karma and rating, then we will inevitably face the situation when the system will produce a result based on opinions that are completely irrelevant to the case. For example, I study Ruby. And somewhere here, a habraiser who knows Ruby well writes an interesting article to me. But at the same time he has a vile temper and bad manners. He breaks into holivar in a topic about PHP, becomes enraged, calls all bydlokoderami and covers mat. The end is a bit predictable. His article will not be published. Of course, he himself is to blame, but I don’t care about his manners? I am interested in Ruby, and the system has blocked the publication of the article, based on poor habitation and lack of restraint.

In a modular system, I would be able to choose which set of source data to feed the algorithm. For example, with the help of tags, I would be limited to the circle of users whose interests, knowledge and skills are relevant to my request. Now this is only possible by blocking the Internet impenetrable bulkheads into small pieces, where in general, all have common interests. For example, our hypothetical ill-bred ruberist would not get a pack of minuses if he were in the narrow community of Ruby programmers. But small communities are less viable and stew in their own juice. With a modular approach, I could look at the same material through the eyes of different people, while remaining in the same familiar environment.

What can a generic metadata format look like?

How to combine files, people, communities and sites into a single structure? Obviously, you need to climb on the level of abstraction higher. What do all these entities have in common? At a minimum, these are resources that are on the Internet. Well, or at least they are presented on the Internet. Nothing like? For example, resources and their representations in the REST model? A film “The Delta Force” with Chuck Norris and a nasal monophonic translation lies, say, in a file sharing network. Actually, a movie, like any REST resource, is an abstract concept; it exists only in the form of its own representations. For example, the usual, directorial and censored versions. Or the original, dubbing and translations into different languages. Or files with different resolutions and bitrates. All this is a representation of the same resource. The resource answers the question “What?”, Its presentation - the question “What?”

This scheme allows you to derive from such basic technical metadata as the number of downloads of certain files, the popularity of the resource as a whole - just add up the counters for all views of the resource. It allows you to flexibly evaluate resources - for example, I put +1 resource and one of its presentations (“great movie!”) And -1 to another presentation (“the director’s version is too tight”). Or, another example: +1 resource ("interesting article") and -1 view ("vyrviglaznaya layout").

When a new object appears on the network - for definiteness, even while it is a movie, a metadata file is automatically created with information about the preview. If the view exists in a single copy (we have posted something absolutely unique on the network), then a second resource-level metafile is created. If the network already had other views of this resource, then information about the new view is recorded in the corresponding resource metafile.

There is an inextricable link between the object and its metafile. For example, a metafile stores the object hash, and the name of the metafile is derived from the object hash in a previously known standard way, say, by re-feeding the hash function of the object. Thus, having a completely unidentified file, once downloaded from the Internet and renamed several times, you can calculate its hash and get all the metadata about it. Or make sure that this file is unique, and the other is exactly the same, most likely there is nowhere in the public domain.

What should be stored in this metafile? The size, type, date of publication, the hash of the file itself, a link to the parent metafile of the resource ... what else? Since we are talking about a distributed network, it would be very nice to store information about the nodes that have copies or fragments of this file, a count of downloads, an estimate of file availability, an approximate speed with which it can be downloaded, and so on, so that each The file was itself a tracker. Also, of course, you will need a view evaluation history and a list of external links (in the case of anonymous publication and secret file distribution, these two sections will, of course, be almost empty). If the file is not intended for public viewing - you need an access control list. In the case of a text file, the index or automatically generated tag cloud will not interfere. Plus an additional tag cloud, created based on the reaction of people.

At the resource level, in addition to ratings, general statistics, and links to the underlying metafiles, there is almost pure semantics: tags and ratings that are common to all representations, and fields that strongly depend on the file type: name, title, gender, age, author, genre, and etc ... The resource level is a bridge between the semantic web with its ontologies and logical inference, and more technical and elemental things, such as evaluations, links, and a hit count. Movement on this bridge is two-way. Knowing the hash - you can get all the information about the file, going up the chain to the semantic level. Having only the most general search query, you can go down to a specific file view.

Since the metadata set can be very different for different types of resources, there is no point in standardizing the format of the metafile as a whole. It is enough to agree on standard names for different types of metadata, and the search algorithm will simply ignore files that do not have the necessary fields. Approximately as in programming languages without strict typification, it’s not so much the type of an object that is important as the set of methods it implements.

Who will fill these metafiles with information? Yes, everything is a little. It is worth emphasizing that metafiles do not belong to the owner of the resource; it can influence their content in a very limited way. In his full power only access control lists. Many things, such as statistics, file size, publication date, peer list, index or tag cloud can be filled in and modified automatically. Others - such as links to higher or lower metafiles or manually created tags, are initially filled by the owner when they are published, but then their content may change under the influence of users. Still others, such as ratings, are initially closed to the file owner and are entirely controlled by the community.

If the format of the metafiles is not rigidly specified, and anyone with access can enter information into them, then how can they be protected from natural or malicious clogging? If we are talking about the modification of existing fields, then the correctness of the information is guaranteed by the consensus of several network nodes, which will double-check each other.

And if someone came up with a search or ranking algorithm, for which completely new characteristics of the resource must be taken into account? You can, of course, introduce some kind of authoritative body that will approve standard sets of metadata for each type of resource. But just from such schemes, we want to leave, is not it? How can you still determine which records in the metafile are useful and in demand, and which are outdated or never needed?

Let's model this situation. Let us store the following information in the metafiles of music resources:

Track name, album, artist, release date, etc.
Specifications - format, bit rate, timing
Download statistics
Resource availability (for example, the number of active seeders if using BitTorrent terminology)
Tag cloud

Suppose someone decides that it would be nice to store in the metafile information about the tempo of the music in Beats Per Minute. He cannot just take and start adding a new field to musical metafiles all over the Internet. But he may try to prove that this information is claimed. For example, create a music search system based on BPM. At first, he will have to collect and store metadata on his own. If people like a new feature, they will increasingly be requesting information about the pace and it will settle in their caches. If information about the tempo of a certain product picks up a threshold number of requests (it's easy for us to find out, because such information is not included in the metafile, which means it is a resource in itself, and the network stores access statistics for it) - we include it in the metafile. Each such inclusion is evidence that information is in demand, and we can gradually lower the threshold, to the point that we begin to demand that the tempo of the music be specified directly when the resource is published. BPM becomes the de facto standard metafile field.

Likewise, you can delete unnecessary metadata. If no one has requested them for a very long time, then you can start cleaning the metafiles.

Thus, most non-specific information can be collected and stored independently of any search engines or reputation systems. On the one hand, we release the creators of search algorithms or recommendations from a heap of preliminary work on collecting information and storing it, on the other hand, we don’t bother with creating and debugging these same algorithms, limiting ourselves only to mechanically filling meta files with raw data.

From files to sites and people

Each site is a composite resource that contains a collection of nested resources. His rating or reputation should depend on a set of indicators of subordinate resources. The very concept of the site is greatly eroded in such a scheme - decentralization is the same!

But for now, websites are the main way to structure the Internet. And in the foreseeable future, little will change. And for the site is very acute question name. If the name of the file somewhere in the network does not bother anyone, then the simple and easy to remember site name is very important. The existing domain name system works at the very least. What can be offered in return?

An entry with the site name in the resource level metafile? It may well be. After all, what is the main problem of the DNS system? The fact that the record of the name server carries practically no useful information. By the name of the site is absolutely impossible to know whether the name of the content, how popular the site is, to whom it belongs. In our metafiles, all this is there. And therefore it will be impossible to mislead anyone by assigning the name “microsoft.com” to your site. None of the clones will be able to get close to the level of reputation that the original has. And if it can, then this means that the original has problems that are not treated by simple methods.

Also cybersquatters will lose all meaning. A lively, actively developing resource, or the official website of a well-known company, will very quickly make squatter for all indicators.

At the same time, names with homonyms, or local sites with the same name, like “taxis,” can quite peacefully exist if the search engine explicitly informs the user about ambiguous results, and ask for clarification of the request (or do it yourself, guided by context, for example, geographical).

How about common names like “beer” or “auto”? Here the principle is the same as with the domains - who first got up, that and sneakers. Only the counting begins not from the moment of the creation of the metafile, but from the beginning of the actual filling and promotion of the site. And, unlike the situation with domains, when the first smart lucky person can fix his name forever and become a dog in the hay, any more active and intelligent webmaster will be able to pull traffic to himself.

And of course, it’s impossible to take away, block or transfer a well-promoted name to another site by any means. The descendants of WikiLeaks will be pleased.

In general, the situation with the names of sites is very similar with the names of people. Nobody will forbid you to change your passport and become Chuck Norris. But you will not learn to kick with a turn. Most likely, the real Chuck Norris will not even know about your fuss with the papers in the passport office.

By the way, erasing the boundaries between DNS and search engines has already begun. In Chrome, there is no separate search line or address, and for many, including me, this is terribly convenient. If I do not remember the exact name of the site, then I type something similar in the same line, and, as a rule, I quickly get where I need to go. This means that the old DNS system and the new metadata format can coexist peacefully. The officially registered domain name will simply be one of the fields in the site metafile.

Semantic web?

Coming up and describing the format of metafiles, I could not get rid of the feeling that I was reinventing the semantic web . To convince myself (and at the same time you) the opposite, I will consider the similarities and differences.

Both there and there are metadata. But in the semantic web, someone must drive this data manually. It would seem, what's the problem? Vaughn, the whole Wikipedia was written by hand, free and anonymously! And no! Writing or editing an encyclopedic article is a difficult and creative work, which means the process itself is enjoyable and serves as a reward to the author. Working with metadata — essentially filling out a form — is a boring and dull exercise. This is one of the reasons why “web 3.0” is so poorly developed.

However, our metafiles are almost entirely composed of information that is automatically generated based on the content of the resource or the reaction of users to it. If you take care of the correct design, so that the few actions on the metadata that need to be done manually are performed in no more than one or two clicks, or one or two words in a small input field, without changing the context and excessive visual noise (no redirects on a separate page with a form, or a window that pops up in the middle of the screen!) - then people will work with the system. For example, on Habré, most people actively press on the plus and minus. If, apart from the abstract “+1” and “-1”, one could click on the tags in a small cloud, this would not have made the interface too heavy, and would have allowed to clarify the semantics of the resource. Something like this:

The second difference is that the semantic web tries to streamline the links between entities, create an ontology, and put everything on the shelves. This is a titanic work. And it is absolutely unnatural. Chaos, constant errors, inconsistencies, polysemy are inherent in human nature. We are well versed in conditions that are unthinkable for a computer. And we create resources on the Internet without regard for their place in the semantic network.Therefore, there is much more sense from metadata adapted for storing associative links between resources, emotional reaction to them, collecting statistics, and not just rational structural information. However, no one bothers to include in metafiles a description of semantics in the form in which it exists now, and to continue the creation of a semantic web. Maybe someday she “shoot”.

The web now stores huge amounts of metadata. But they form a not very comfortable for life landscape with high mountains and deep abysses. The highest peaks of the modern Internet remind unshakable mountain peaks only in appearance. They are extremely flexible and gentle, avoiding confrontation with local laws and losing markets. And those who do not want to be flexible are at risk of being trampled by governments and competitors. And we suddenly discover that the information that was easily available yesterday disappeared somewhere.

List of sources and interesting

Collective Intelligence

One of the approaches to creating a distributed DNS

Attempts to store metadata in a peer-to-peer network:
Magnet links ,
Shareaza ,
Metalink ,
Magnet-manifest (Magma) Collections .

Distributed search engines:
Majestic-12 ,
Allisk ,
Grub .

Source: https://habr.com/ru/post/114029/

All Articles