How the world of semantic micromarking works

I work on the semantic web team in Yandex. We are committed to creating products based on semantic markup, making our extensions and participating in the development of the Schema.org standard.

The world of semantic markup is not quite simple, and at first glance it’s not even always logical. In order to make life easier for those who want to understand it, we decided to write a story about how the markup is, what it gives and how to implement it.

')
By micro-markup (or semantic markup), we mean the markup of a page with additional tags and attributes in tags that point search engines to what is written on the page.

Micromarking consists of a dictionary and syntax.

A dictionary is a kind of “language”, a set of classes and their properties, with the help of which the essence of the content on the page is indicated. For example, the dictionary defines with which term to indicate the name - “name”, “title” or “n”.

Syntax is a way to use such a language, i.e. dictionary. It determines with the help of which tags and how entities and their properties will be indicated, for example, on web pages.

The semantic markup developed in stages, at one time various initiative groups took on the development of the concept. And in the end it turned out a vinaigrette from different dictionaries and syntaxes - there are quite a lot of them and it is far from easy to deal with all of them first.

In this article we will examine the most common dictionaries:

Open Graph;
Schema.org;
Microformats;
And a few other dictionaries: FOAF, Dublin Core, Data Vocabulary and Good Relations.

Open Graph is a dictionary that Facebook has developed so that any site can become part of this social network and display beautifully in it. With the help of OG extended site links are shown.

Schema.org is a dictionary that is developed together by the largest search engines so that webmasters do not have to mark up separately for each search engine. The Schema.org markup allows sites to receive special snippets in search results.

Microformats were developed by W3C enthusiasts who wanted to make their standard using basic HTML elements. Often there are difficulties with the difference of microformats and micromarking - we immediately note that this is not the same thing. Microformats are one of the microdata dictionaries, as well as Schema.org, Open Graph or FOAF. The only difference is that microformats are a combined standard of syntax and vocabulary. Whereas micro-marking, as we said above, is a collective term for a method of enriching a page with semantic data.

We describe the idea of creation, the process of development, the described entities and properties, and give small examples of markup for each dictionary. And in the following articles we will write about syntaxes, products and ways of introducing micromarking.

The most common dictionaries on the Internet

Open graph

Open Graph (OG) is the most common and simple dictionary. Now Open Graph is most often used to ensure that published links from sites are extended, beautiful and understandable. With the OG markup, the links will be shown on all popular social networks.

Also, Open Graph markup is actively used by applications for Facebook - it allows users to reflect actions from applications on their pages.

Thanks to OG, you can watch videos, read a brief description of the article, and quickly understand the essence of information shared by friends while viewing endless news feeds. In addition to Facebook, Open Graph also recognizes Vkontakte , Google+ , Twitter , LinkedIn , Pinterest and others.

The dictionary itself is fairly easy to use - you need 4 properties to start using:

og: title - the name of the object.
og: type - type of object, for example, “video.movie” (movie). Depending on the type, you can specify other properties.
og: image - the URL of the image describing it.
og: url - the canonical URL of the object that will be used as the permanent ID.

For example, the Open Graph markup for describing a person looks like this:

<html prefix="og: http://ogp.me/ns# profile: http://ogp.me/ns/profile#"> <head> <meta property="og:title" content=" " /> <meta property="og:type" content="profile" /> <meta property="og:url" content="http://example.com/" /> <meta property="og:image" content="http://example.com/" /> <meta property="profile:first_name" content="" /> <meta property="profile:last_name" content="" /> <meta property="profile:gender" content="male" /> ... </head> ... </html>

Here the robot recognizes that the page is dedicated to a man named Yuri Gagarin, there is a link to his photo on it. Here, a property such as url is specified as the canonical URL of the page.

In addition to the profile type, the og: type tag can contain various types of entities (which also have their own properties):

Music (subtypes music.song, music.album, music.playlist, music.radio_station) - for songs you can specify the duration, album, artist, for albums - songs, artists, release dates.
Video (video.movie, video.episode, video.tv_show, video.other) - films may have actors and their roles, directors, screenwriters, duration.
No vertical (article. Book, profile, website) - here are those types that do not fit the above categories. The article can specify the tags, author, date of publication. Profiles have a gender, last name, first name.

If you do not use this markup on the site page, then when you publish a link to Facebook, the system will in any case try to build a preview. But, as a rule, this is far from being so successful - instead of a picture for the article, the logo from the site may be reflected, the title may be replaced with the name of some category of the site and the text from the company's history that will not reflect the essence of the article ( and is unlikely to please the user).

In addition, the search engine OG recognize and search engines, in some cases, even complement it.

Schema.org

Schema.org is a dictionary initiated by search engines in 2011. It is supported by Yandex, Google, Bing and Yahoo!

Schema.org also provides collections of classes that describe various entities and their properties. But if in OG and Microformats.org such classes are in the tens, in Schema.org there are already several hundred. All classes have their place in the tree hierarchy .

This is a lively and flexible dictionary. New entities are actively discussed before adding: for this, the members of the initiative group meet weekly and discuss the implementation, expansion and use of schemes.

The most generalized type of entity is Thing , which has subtypes. Consider a few of them:

Action - describes an action that can be performed by someone specific (person or organization). This action can additionally indicate the place, object and tools with which it was accomplished. Like any action, it can have a result, participants and a period of time during which it was performed. With the help of this class, which describes actions, Yandex.Ostrova is being developed, as well as the Gmail Actions project.
CreativeWork - describes the features of creative works. Videos, pictures, recipes, diets - everything can be described using this type. In all creative works, you can specify the author, genre, a brief description, as well as reviews, audience, references and much more.
Event - like any event, here you can describe the venue, date, participants, speakers, etc.
Product is all that is sold and bought. Some paid services (such as haircut) can also be described by the Product type. A product can have a rating, brand, color, audience, price, model, etc.
Person - as indicated in the Schema.org documentation, it could be a person - alive, fictional or already dead - as well as “undead” (apparently, the creators needed to describe more zombies and there was no more suitable type). People may be given contact information, information about work, family, relationships, and more.

The process of creating and introducing new types is quite fascinating and in some cases very interesting and unexpected. In the next discussion, it became clear that it is far from easy to make the implemented schemes coincide with the Russian mentality and the international idea of the beautiful.

From our experience. For almost a year, 7 new fields were introduced into the type schema.org/PeopleAudience , since there was no limit to the doubts of politically correct Europeans and Americans: “How can the maximum age of the target audience be indicated? The fact that a man is over 30 does not mean that he is not interested in books for little girls! ”Ok, the proposed maxAge and minAge fields turn into the suggestedMaxAge and suggestedMinAge. With the floor it turned out too, everything is difficult. It was not possible to convince that the sex could not be clearly indicated - it is not politically correct. So gender has become the suggestedGender.

So long, painstakingly, every property and every type is being introduced - after all, the dictionary, besides the fact that it should cover the area of use to the maximum, be international, also should reflect the interests of all participants and be unambiguous from the point of view of different countries and cultures. And, nevertheless, introducing a new property or type is always easier than deleting or changing, because when deleting, you need to do something with those who have already implemented these fields or types.

It also provides the ability to expand the dictionary on the initiative of users and webmasters.

There is a public posting in English public-vocabs@w3.org , created to discuss common issues, suggestions and error messages, where you can also write a letter with the question of markup, if you are unable to implement something. There is an extension mechanism , and since May 2011, lists on external resources can be used to specify various properties.

So if you want to take part in the development of semantic markup, in particular the Schema.org dictionary, you have such an opportunity;)

An example of Schema.org markup for the Person type:

 <div itemscope itemtype="http://schema.org/Person"> <span itemprop="name"> </span> <img src="gagarin.jpg" itemprop="image"/> <span itemprop="jobTitle">-</span> <span itemprop="colleague"> </span> <link itemprop="nationality"href="http://ru.wikipedia.org/wiki/"> <time itemprop="birthDate" datetime="1934-03-09">9  1934</time> <span itemprop="memberOf">-  </span> <span itemprop="knows"> </span> <time itemprop="deathDate" datetime="1968-03-27">27  1968</time> <span itemprop="award">  </span> <a href="http://ru.wikipedia.org/wiki/,__" itemprop="sameAs">  </a> <a href="http://example.com/" itemprop="url">  </a> </div>

In such a markup, the search engine recognizes that a man named Yuri Gagarin is a pilot-cosmonaut and is a colleague of Valentina Tereshkova. Many other data are also indicated: his reward, nationality, date of death, acquaintances, and others — some of these properties can only be specified using the Schema.org dictionary. There are two links marked up using the “sameAs” and “url” properties, where in the first case a page with reliable information about a person is indicated, and in the second - a link to a personal website.

I would like to note once again that Schema.org is a search engine initiative. And the development of the dictionary will depend on the creation of products by search engines for sites. Therefore, you should not take this dictionary as an attempt to bring to a single ontology everything that exists in the world. Everything existing on the Internet is possible. But if it will be necessary to search engines.

And in creating a large number of products for sites based on Schema.org, including for Russian-speaking, search engines are certainly interested.

You can get acquainted with the full description of the dictionary on the official site . There is an unofficial and incomplete translation of the standard into Russian on the site .

Microformats.org

Microformats.org (Microformats) is an open standard created in 2007 by a community of enthusiasts. This community wanted to create a standard for semantic markup of sites, using previously existing technologies. Six years ago, this was a definite plus of the standard, since it was easier to implement, but now adding microformats markup is not easier, and in some cases more difficult, than other dictionaries. Compared to OG and Schema.org, it is used less and less.

At the moment there are about 10 common microformat specifications for several subject areas. Some of them are completed, but most are at the draft stage. There are microformats for publishing information about organizations, products, reviews, events, and many other entities. Each entity has its own properties.

The development of new microformats occurs in open mode, there is a separate microformats wiki . Due to the fact that during the creation of each microformat, the founders tend to agree and find a compromise with everyone, the process lasts a very long time, and sometimes it does not end there. Because of this, the modified microformats can be counted on the fingers, and there are quite a few of those that have the status of drafts.

Currently, the following microformats are supported by search engines:

hCard - a format for marking contact information (addresses, phones, etc.);
hRecipe - a format for describing recipes;
hReview - layout markup reviews;
hProduct - the layout of goods.

Their use allows you to show special snippets in the issue.

One of the most popular microformats is hCard. The microformat hCard is universal for describing people and organizations, contains basic information about both.

Using hcard, you can specify properties such as:

n is the name;
bday - date of birth;
geo - geographical location;
tz - time zone;
uid - reference to an identical entity;
photo - image;
adr - address;
org - the name of the organization.

This is part of the approved properties, there are also many of those that are under discussion. Here is how hcard is used in the markup of a person’s description:

  <div class="vcard"> <img class="photo" src="http://example.com/gagarin.jpg" /> <strong class="fn"> </strong> <span class="title">-</span> at <span class="org">-  </span> <a class="url" href=http://example.com/> .</a> <div class="bday"> <span class="value-title" title="1934-03-09">9  1934</span> </div> <span class="note">   </span> </div>

Here, the search engine understands that this is about the organization or a person named Yuri Gagarin - this is a cosmonaut who worked in the Air Force of the USSR. Also known for his date of birth and there is a note "The first man in space." The url property here points to the home page of the object being described.

In 2013, a new initiative was announced - microformats 2 , in which there are innovations in the class names and simplifications for the use of properties.

Previously microformats were quite common, but today, especially against the background of other fast-growing dictionaries, they look ~~meaningless and merciless~~ obsolete. In addition, the use of microformats limits their format - this is the combined standard of syntax and vocabulary, in which other dictionaries cannot be used. (The following article will be about the syntax.)

We reviewed the most common and developed dictionaries. But there are still quite a few highly specialized, small dictionaries, which were also created to address the issue of data transfer. I'll tell you about the most interesting of them.

Other dictionaries

Foaf

The FOAF dictionary (acronym for Friend of a Friend - “each other”) specializes in connections between people, their interactions and associations.

It contains classes such as Agent, Organization, Group, Person. They may have different properties that describe people or groups in life. There are the usual - age, gender, surname, birthday, and also there are properties:

linked to social networks: skypeID, yahooChatID. jabberID.
specific: for example, knows - to describe the familiarity of people with each other or myersBriggs, which reflects the results of the Myers-Briggs career-guidance test (yes, we also only learned what it is).

Markup example:

 <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"> <foaf:Person> <foaf:name>Jimmy Wales</foaf:name> <foaf:mbox rdf:resource="mailto:jwales@bomis.com" /> <foaf:homepage rdf:resource="http://www.jimmywales.com/" /> <foaf:nick>Jimbo</foaf:nick> <foaf:depiction rdf:resource="http://www.jimmywales.com/aus_img_small.jpg" /> <foaf:interest> <rdf:Description rdf:about="http://www.wikimedia.org" rdfs:label="Wikipedia" /> </foaf:interest> <foaf:knows> <foaf:Person> <foaf:name>Angela Beesley</foaf:name> <!-- Wikimedia Board of Trustees --> </foaf:Person> </foaf:knows> </foaf:Person> </rdf:RDF>

In the search for blogs from Yandex , this dictionary is used. An extension was added to it , which helps to accurately describe the blogs of users (in RuNet, this extension is mainly used).

Data vocabulary

Dictionary Data Vocabulary was developed by Google. At the moment, it is no longer evolving, since all development has flowed smoothly into Schema.org

Previously, types such as Person , Organization , Breadcrumb , Review , Product , Address were supported - it can be said that they became prototypes of Schema.org classes.

Dublin core

The Dublin Core Dictionary (or Dublin Core) is used in electronic libraries and documents. The Dublin Core emerged on the initiative of a group of library and museum specialists.

Dublin Core appeared in 1995 with a basic set of 15 elements, such as Title, Creator, Subject, Description, Publisher, Rights, etc. Now there are already many different classes and properties.

In Russia, since 2011, the state standard GOST R 7.0.10-2010 ( ISO 15836: 2003 ) is even valid for it “The national standard of the Russian Federation. System of standards on information, librarianship and publishing. Dublin Core Metadata Elements Set

Dublin Core markup example

 <HTML> <HEAD> <TITLE>Song of the Open Road</TITLE> <META NAME="DC.Title" CONTENT="Song of the Open Road"> <META NAME="DC.Creator" CONTENT="Nash, Ogden"> <META NAME="DC.Type" CONTENT="text"> <META NAME="DC.Date" CONTENT="1939"> <META NAME="DC.Format" CONTENT="text/html"> <META NAME="DC.Identifier" CONTENT="http://www.poetry.com/nash/open.html"> </HEAD> <BODY><PRE> I think that I shall never see A billboard lovely as a tree. Indeed, unless the billboards fall I'll never see a tree at all. </PRE></BODY> </HTML>

Good relations

The Good Relations Dictionary has been used since 2008 as a standard for describing e-commerce products. The creators expected that the use of such markup will give a structured representation of goods and services in search engines.

Using the dictionary, you can specify special properties for

Companies - contact details, location, logo;
Store - address, hours of operation, telephone;
Separate product - product category, brief description, code, payment methods of delivery, as well as functions for services (repair, installation, rent, etc.)

Good relations describes the following areas of online commerce: Books (Books), Cars (Auto), Classified ads, Concert tickets (Concert tickets), Consumer electronics (Household appliances), Guided tours and outdoor events (Excursions and events) and others .

In RuNet, this dictionary is practically not used, but is found on some major foreign sites ( Volkswagen UK , Strobelight-Shop , lux-case.se ). From search engines GR markup recognizes Google .

An example of markup using Good Relations:

 <div typeof="gr:Offering" about="#offer"> <div property="gr:name">HTML for Idiots - Used Copy, $ 9.99</div> <link rel="gr:hasBusinessFunction" resource="http://purl.org/goodrelations/v1#Sell" /> <div rel="gr:hasPriceSpecification"> <div typeof="gr:UnitPriceSpecification">Price: <span property="gr:hasCurrency" content="USD">$</span> <span property="gr:hasCurrencyValue" datatype="xsd:float">9.99</span> <div property="gr:validThrough" datatype="xsd:dateTime" content="2012-11-30T23:59:59Z"></div> </div> </div>

The Good Relations standard has been integrated into Schema.org since November 2012 , the dictionary also has its own validator

Yandex extensions for dictionaries

To obtain all the necessary data from the sites in Yandex, its extensions are being developed for some dictionaries .

For example, this is needed for markup:

interactive answers in Yandex.Islands (for describing forms and buttons );
vocabulary articles ( terms and scientific articles );
ranking organizations ;
target audience .

In the following posts, we want to talk in detail about other sections of semantic markup — for example, syntax, products, and implementation examples. If you are interested in any other topics - share it in the comments.

Source: https://habr.com/ru/post/211638/

All Articles