MilkyWeb - Graph of Everything

In this article I want to share my thoughts on how to solve the fundamental problems of the modern Internet. I want to describe a model that, in my opinion, can help even better streamline knowledge on the Internet, and demonstrate its attempt to implement such a model.

Intro

Social networks and search engines are trying to streamline as much information as possible about the world around them and in particular about the user.
In computer science, ontologies, or their simplified version - graphs, are the basis for the description of any subject area (software). It is with their help that it is possible to describe any knowledge base most uniquely for a computer.

A large number of specialized graphs have been created in the existing web space: Facebook, Linkedin, Foursquare, etc.
As you know, Google is expanding its Knowledge Graph and actively using it in search engines.
')
The problem is that in the world an infinite number of subject areas and to create a new graph, it is customary to create a new social network.

The MilkyWeb project (MW, MilkyWeb), which I want to present, is an attempt to create a universal tool for describing any subject areas (creating any graphs) in one place.
In other words, it is an attempt to create a universal ~~social network~~ knowledge base of everything in the world.

The project has not yet emerged from the alpha stage, so the interface leaves much to be desired, for which I ask you not to be angry. The site was made up only under Chrome. I decided not to waste time on cross-browser support, so I apologize to users of other browsers.

Ideology

The ideology of the project is based on the ontology model - the mathematical representation of knowledge.
It is based on three pillars: concepts, individuals and predicates.

Concepts are abstract concepts of the surrounding world. Roughly speaking, these are generalized (collective) names of things and phenomena that surround us.
Many concepts form a hierarchy, for example, the notion “Programmer” is derived from the notion “Man”, and the latter in turn is an “Organism”.
One can draw an analogy with programming: Concepts in ontology are classes in OOP.
Concepts are of two types: abstract concepts and sets (or Ancestors from the English. Ancestor - the ancestor, the progenitor). The concept of “friendship” is abstract, while “machine” is the name of a multitude of real objects.

Individuals are objects that surround us in the real world. Each individual is the realization of at least one concept-ancestor. In the context of OOP, concept individuals are instances of classes.
For example: the object “Albert Einstein” is an individual of the concept “Scientist”. Inheritance is naturally supported. Since “The Scientist” is “Man,” then “Albert Einstein” is also “Man.”
When a new user creates an account in MW, in fact, this means the creation of a new individual of the “Man” concept in the ontology.

In terms of graphs, concepts and individuals are vertices of the graph, while the edges (or arcs) are predicates.

Predicates are properties with which the vertices of the graph are connected.
A simple example of a predicate, as many have been able to guess, is the relationship of friendship in FB or VC.

The link shown in the figure above is called a triplet , since It involves three components: the subject “Richard Feynman” (top of the graph), the predicate “Born” (arc of the graph), the object “New-York” (top of the graph).

In fact, the whole task of the MilkyWeb project is to ensure that the user can create a page of any object of the surrounding world (concept or individual) and can semantically correctly link it to other pages (using a predicate).

Each predicate is created in conjunction with one or more concepts.
For example, the properties of "friend" or "mother" can only be available to individuals of the concept "person"; and the predicate “CEO” can bind “person” and “company”.

Predicates are literal. Such literal predicates do not point to the top of the graph, but to a value. Each literal has a type, for example: a string, an integer, a date, geographical coordinates, etc. (currently only URL literals are supported).

Concepts and predicates are the skeleton of any ontology, that is, the template on which the entire graph is built, so at the moment these entities can be created only by the site administration. This process includes not only the creation of entities as such, but their configuration, which the user does not see.
For example, each predicate has a threshold for the maximum number of triplets. So with the predicate "mom" an individual can have only one triplet, and with the predicate "friend" - many.

Example

As I said, the administration creates an ontology framework, and users fill it.
I will give an example of filling the domain based on the concept of "film."

The administrator creates the concept of "film", and a set of necessary predicates such as "in the roles", "director", "producer", "premiere", "country", "favorite film", "watching".

The user FOO, on the basis of the concept “film”, creates a page (individual) “Pirates of the Caribbean” and begins to “describe” it.
Using the “cast” predicate, he points out that Johnny Depp and Keira Knightley were shot in the film.
He then links the page with the producer, the director and the country.
By the literal "premiere" the user indicates that the film premiered on June 28, 2003.

Okay, the basic information about the movie was entered, but what next? Further FOO may indicate that “Pirates of the Caribbean” is his “favorite film”.
At this time, the GOO user, who is a friend of FOO, just missed the monitor and saw a triplet just created by FOO in his tape. He took it as a call to action and decided to ~~download a movie on torrents~~ to buy a DVD with this picture and immediately watch it! Starting to eat the film, he created a “watch” triplet “Pirates of the Caribbean”, thereby telling the whole world a small part of his life with one click!

I knowingly chose the subject area "films" for an example. Facebook engineers are working to structure such moments from the lives of people. Read more: www.wired.com/business/2012/11/mike-vernal-facebook

I would also like to note that the predicate “favorite movie” and the “I like” button on the movie page on the IMDB website are not the same thing. The semantics of likes are very blurred and do not allow us to unequivocally say what the user had in mind when “like” this or that page.

Such a structure greatly simplifies the description of a particular subject area. If Facebook has a constant set of templates for creating pages, then in the system described above, templates can be created on the fly. If at one point in time we decide to introduce new software into a social network, it will be necessary to simply create a set of concepts and predicates characteristic of this sphere.

At the moment, all generated pages support only English (it is necessary to take into account when searching). Plans to make a localization mechanism in other languages.

Data sharing and Big Data problems

I did not find a suitable expression established in Russian that means sharing in the sense of “sharing information” or “disseminating information”, so the term in the title was left without translation.

Recently, to describe the area, which is characterized by rapid growth in the amount of information, it is customary to use the concept of Big Data . A priori, this term implies a problem: the rate of data generation is so great that the most valuable information can be lost in the general flow. So that information is not lost, it is necessary to structure and classify it.

As practice has shown, the formation of a news-feed based on posts “from friends” is not the best option. More precisely, this method is good for getting information about people around you, but not about things of interest in general.
As a result, in Vkontakte the news feed is littered with cats and quotes from “great” people. You can try to subscribe to thematic publics, but this does not guarantee delivery to the user feed of all the currently generated information that might be of interest to the user.
Facebook sculpts a crutch behind a crutch to deliver only the most relevant information to the tape to the user. And to some extent this is enough, but the algorithm for building a news feed is based on user actions (analysis of likes, comments, etc.), so this is also not universal.

In my opinion, the most successful approach to the model “came, found out everything relevant, left” was obtained by Twitter and Hacker News.
Therefore, from the very beginning I tried to make the mechanics of information dissemination in MilkyWeb something between T and HN. Those. the user enters the site and receives all the information that he might be interested in lately X.
But not only from the pages to which it is subscribed (Twitter, FB, VK), but also on thematic streams (HN).

In MW, you can distribute text (up to 2000 characters), links and videos (YouTube). There are no photos yet - it is expensive to keep them.

How can a user share information and who will receive this information?

User can:

post messages to your page;
In this case, the message will reach those users who follow the sender.
It is worth noting that if user A created at least one triplet with user B, then A is considered to be subscribed to B.
post messages to a page to another user;
Obviously: the message will only reach the addressee.
post messages on the pages of individuals "not users";
The message will reach everyone who is subscribed to this entity.
Broadcast messages in thematic threads.
Thematic streams are all about concepts. Those. You can post a message to the "programming" page. In this case, the message will reach everyone who is subscribed to "Programming", as well as to all users who "inherit" the concept of "Programmer".

The last two ways - this is what concerns the attempt to solve the problem of Big Data. The basic idea is:
User X has information that is thematically related to a particular area of real life. He does not think about where to post this information, but simply throws it into a general thematic stream for this or that software. And now the task of the system is to select the most valuable data from the general flow based on the actions of other users (for example, ranking or repost).

Work on this mechanism is still underway. There is no content ranking system yet, but it will be implemented in the near future, and there are ideas how, based on all this, to make a custom news flow more relevant than in other networks. It is the model described in the previous chapter that allows one to semantically unambiguously distinguish between concepts and correctly classify information.

Naturally, this approach can generate spam waves. At the moment, the site can not post more than one message in 20 seconds. In the future I will be more reasonable to solve this problem. Now the task is to check the viability of the mechanics and just highlight the possible critical moments.

As the reader probably guessed, in such a system there is a great potential for the distribution of targeted content. You can make complex samples to select your target audience. For example, send a message to everyone who is a “Programmer” and “lives in” “Moscow”; or those who "bought" "iFon" and "bought" "iPad"; or to everyone who drives a Mercedes.
Maybe someday it will become a way to monetize, but now the mission of the MilkyWeb project is different. I want to talk about it in the next chapter.

Semantic web

The Semantic Web (SP) is a web space in which content generated by a person is understandable for a computer.
This can be achieved by adding metadata to a web document (for example, html). Metadata is widely used on the web and plays important roles in searching, structuring data, etc.
But in order for the search engine to “understand” the content of a page, it is necessary that this page be accompanied by a separate document with a computer-friendly description (in the form of a graph) of that part of the world about which the reference page is concerned.

The specification requires that such meta-documents be compiled in RDF format. The problem is that these files must be created by someone in order to be attached to the html document.

Actually, this is the problem that I took up two years ago in the form of a thesis. The goal was to make a convenient and interactive tool for creating RDF descriptions, a centralized metadata repository, where they will be accumulated and will not be duplicated.

Over time, I deviated a little from the given direction in favor of the social aspect. But it is already possible to get an RDF description of an entity by going to milkyweb.net/rdf/ {c | p | i} / id_ of essence . For example, the RDF documents of the individual “ Moscow ” and the concept “ Human ” are located at the addresses milkyweb.net/rdf/i/10460 and milkyweb.net/rdf/c/10000 respectively (user information is naturally not public).

That is, all that remains for the webmaster is simply to attach a link to the necessary object to the web page of your site. In the future, the search engine will take a document by the specified URL and will be able to classify content on the page, increasing the relevance of the search results. Or it will be possible to observe in real time the appearance of content for a particular entity throughout the Internet. Agree, cool! :)

For specialists in this field, I note that integration with existing dictionaries is planned.

Of course, I greatly simplify everything. In order to popularize the joint venture, one social network is not enough. Most likely, you need to create special frameworks for web developers that automate the process of tagging content with metadata. But I believe that sooner or later such a mechanism will work, and the first step in this direction is the creation of a global knowledge base of the Internet.

Problems

The biggest problems I encountered lie in the ideologists of the project and in the terminology of ontologies as such.

All the specifications of the joint venture technology (RDF, OWL) from the W3C claim that to describe web ontologies, you can do with Concepts, Individuals and Predicates, and I believed this for a while.

In the Russian Wikipedia you can find the following description of the concept “Concept”:

Concepts (concepts) - abstract groups, collections or sets of objects. They may include instances, other classes, or a combination of both.

And then there is an example with a small digression:

The concept of "people", the nested concept of "man." What a “person” is - an embedded concept, or an instance (individual) - depends on ontology.

A remark inconspicuous at first glance (in italics) is a fundamental problem of philosophers of all times.

If we begin to create a global ontology according to these “classical rules”, our entire structure will immediately collapse, as I personally convinced.
Initially, I believed that the concepts in my network are abstract concepts that may or may not have individuals.
And individuals, in turn, are real objects that can be touched by hands and which “realize” certain concepts.

But suppose we have the concept of “Phone”. Now we need to create an iPhone page. But what is “iphone”: concept or individual? Suppose it is an individual. And at some point in time, the user FOO decides to create a personal page of that device “iPhone”, which is in his pocket. What for? It doesn't matter if he wants to sell it. The important thing here is that if “iPhone” is an individual, then you can’t create a page for a specific device, because we have limited the level of abstraction and the system ceases to be complete.
Okay, suppose iphone is a concept. But we initially decided that concepts are fundamental concepts, they cannot come and go with time. That is, we will not, for each new product created by mankind, create a separate concept in the hierarchy.

Therefore, the very idea that there are concepts and individuals in the world is true only within a predetermined framework and such an approach cannot be used to create a global ontology.

There are a lot of such pitfalls, and I think that it is possible to create a universal way of describing the world only by means of checks and permutations.

Outro

I do not expect a quick return from the project, as I said earlier - at the moment it is an experiment.
Many questions and problems stand out. Perhaps the global graph does not take place at all. Or, perhaps, the proposed approach is simply unsuitable for its creation.
The goal of my activity is to find practical ways to solve the fundamental problems of the global web.

All that I described above is just the tip of the iceberg of my ideas and ideas. If the topic is relevant, I will try to continue the cycle of articles.

I will be grateful for feedback of any content! You can write in the comments, in a personal or in the form on the site (about bugs and hacks post there too).
If someone outlined the ideas seem interesting, and this someone wants to take part in the development of the project - I am open to cooperation (the core of the site is Java + MySQL).

By development, I mean not only the development, but also the filling of the knowledge base.Now the network has created about 1000 entities in different subject areas, which, of course, is very small. If you have not found the page of your city, country, favorite music group, movie, etc., try to create such a page and share your user experience.

PS: Those who requested the invitation, do not be surprised if it does not come immediately. The SMTP server is our bottleneck. You can write to me in a personal - kin.

Thank you for attention!

Source: https://habr.com/ru/post/163639/

All Articles