📜 ⬆️ ⬇️

How and why we made our micro markup validator

Recently we wrote about our semantic markup validator . And today we want to tell why and why it was made, what difficulties arose during the development and how we coped with them. One of the reasons why we made it, of course, is that we wanted to save robots from meeting with webmasters' mistakes. But we were guided not only by this.

The reaction of robots to errors in micromarking

Slowly but surely, semantic markup is gaining popularity. A little more than ten years ago (in May 2001 ) the term "semantic web" was first introduced. In 2004, the first mention of the RDF a format appeared, at about the same time microformats began to develop. In June 2011 , the schema.org standard was launched. Now, semantic micromarking is supported by both Yandex and other leading global search engines.
')
However, webmasters are often faced with the fact that HTML validators give a lot of errors. The fact is that most of them do not support micromarking. For example, validator.w3.org supports HTML5 in test mode and perceives microdata as an error:



There are many tools for checking semantic markup. Some of them are universal (for example, http://validator.nu/ ), others check for a specific type of markup. There are RDFa validators ( checkrdfa , w3.org/RDF/Validator/ , rdfabout.com and others), the OpenGraph validator from Facebook, the validators of microformats (for example, the hCard validator ). Search engines that use markup offer their validators.

When Yandex began to use microformats and schema.org in its services, it turned out that each markup consumer has its own peculiarities of its use and its own set of extensions. Therefore, the validator makes webmasters life much easier. In addition, we warn not only about errors in the standard, but also about what needs to be changed in the markup so that we can use the data in our services.

Development of a universal semantic mark validator


Now the validator processes all popular types of semantic micromarking (microformats, microdata, RDFa) and popular dictionaries ( schema.org , OpenGraph ). These formats differ in the logic of embedding on the page and the internal structure.

The most difficult to disassemble microformats. Apparently, during their development, more attention was paid to the convenience of embedding into the page, and not to the ease of data extraction.

Microformats are embedded as CSS classes, and to find them on a page, you need to understand their structure. However, they can be nested in each other. Or one microformat may be the field of another. Some microformats in some cases can be independent, and in others - fields of other formats. For example, hCard can be used both by itself and as a field. Others themselves never occur (for example, the adr field in hCard).

In our microdata validator there is a small framework that understands the logic of parsing different microdata formats. Thanks to him, we can separately add new fields or even whole types of markup.
Consider an example of a code fragment that parses and validates the hRecipe microformat:

public class HRecipe extends Microformat { final private static HRecipe instance = new HRecipe("hrecipe", true); protected HRecipe(@NotNull final String name, final boolean root) { super(name, root); } /** *     . *    : * MFPropertySingular –      , * MFPropertyPlural –  , MFPropertyConcatenated –  * ,   . *         ,     . */ static { instance.addProperty(new MFPropertySingular("fn", TextProperty.getInstance())); instance.addProperty( new MFPropertyPlural("ingredient", Ingredient.getInstance(), TextProperty.getInstance())); instance.addProperty(new MFPropertySingular("yield", TextProperty.getInstance())); instance.addProperty( new MFPropertySingular("instructions", Instructions.getInstance(), TextProperty.getInstance())); ... instance.addProperty(new MFPropertySingular("result-photo", URIProperty.getInstance())); instance.addProperty(new MFPropertySingular("summary", TextProperty.getInstance())); } ... } 

When parsing microformats in the tree, points that can be rooted are highlighted. Then there are all the elements that can be used as fields (adr, fn, etc.). After that, the format is cleared from the embedded cards. Then begins parsing the format. If fields that are also microformats are found, they are understood recursively. You can see how this works using the hCard microformat as an example:

 <div class="vcard"> <!--  fn org url      --> <a class="fn org url" href=" http://www.commerce.net/"> CommerceNet </a> <!--  adr   (       )      : street-address, locality  . --> <div class="adr"> <span class="type">Work</span>: <div class="street-address">169 University Avenue</div> <span class="locality">Palo Alto</span>, <abbr class="region" title="California">CA</abbr> <span class="postal-code">94301</span> <div class="country-name">USA</div> </div> <!--  tel       type --> <div class="tel"> <span class="type">Work</span> +1-650-289-4040 </div> <div class="tel"> <span class="type">Fax</span> +1-650-289-4041 </div> <div>Email: <span class="email">info@commerce.net</span> </div> <!--         ,    .   agent    hCard--> <div class="agent vcard">Contact person: <a class="fn n email" href=" mailto:johndow@commerce.net">John Dow</a> </div> <!--    hCard    ,        --> <div class="vcard">See also: <a class="fn org url" href=" http://www.yellowpages.com/ATT">AT&T</a> </div> </div> 


All microformats differ in their internal structure and composition of fields. Interesting, for example, is the hResume format (or rather, the draft of this format - it, like many others, does not have a complete version yet). In this format, the experience field is simultaneously defined as hCalendar (to indicate the time interval) and hCard (to describe the organization in which the person worked). Thus, there are fields that refer to two cards at the same time. When parsing other microformats, we attributed each field to one card. This helped us to divide microformats arbitrarily nested into each other. To learn how to parse resumes without breaking the logic of the parser, we added our artificial data type to the framework - an extended hCalendar into which we put all the fields related to work experience. Thus, we disassemble this microformat not quite according to the specification, but correctly.

 <div class="hresume"> <div class=" position first experience vevent vcard summary-current" style="display:block"> <span class="n fn" id="name"> <span class="full-name"><span class="given-name">Peter</span> <span class="family-name">Savelyev</span></span> </span> <div class="postitle"> <h4><strong> <a class="company-profile" href="/company/10718?trk=pro_other_cmpy"><span class="org summary">Yandex</span></a> </strong> </h4> </div> <p class="period"> <abbr class="dtstart" title="2011-07-01"> 2011 .</abbr> <abbr class="dtstamp" title="2012-12-11"> </abbr> <span class="duration"><span class="value-title" title="P1Y6M"> </span>(1  6 )</span> </p> <p class="description current-position">structured information extraction using semantic technologies</p> </div> </div> 

The parser will handle this example like this:

  hresume experience vevent vcard fn = Peter Savelyev n family-name = Savelyev given-name = Peter org = Yandex dtstamp = 2012-12-11 dtstart = 2011-07-01 duration = P1Y6M summary = Yandex description = structured information extraction using semantic technologies 

Many microformats do not have ready-made specifications, although they are already actively used. Now only hCalendar, hCard and a few rel are finally defined. The rest are draft versions of specifications of varying degrees of readiness: some are almost ready (for example, hRecipe), others are far from perfect and have a number of uncertainties (for example, hListing ). Many of the microformats are defined only by examples, so in controversial situations it can be difficult to make an informed decision.

hListing is a microformat for the description of goods and services. For him, there is not yet a formal description of the field of action action, which characterizes the type of service (sale, purchase), therefore webmasters here come to their own discretion. For example:

 <div class="hlisting"> ... <span class="offer rent"></span> ... </div>   <div class='hlisting offer-rent'> ... </div> 

We postponed the support of this format until the release of a more complete version of the specification.

It is also not always convenient to work with the schema.org dictionary. In the real world, micromarking works differently than it could have worked under ideal conditions.

For example, for links, you can only mark the url. However, the link text itself may contain useful information. Therefore, our parser now saves not only the link - how it should be done according to the specification - but also the text:

 <a onclick="(new Image()).src='/rg/title-overview/director-1/images/b.gif?link=%2Fname%2Fnm0000487%2F';" href="/name/nm0000487/" itemprop="director">Ang Lee</a></div> 

If we analyze this example according to the rules, then in the “director” field only the link "/ name / nm0000487 /" will be located, but in most cases this information is not enough. Therefore, our parser will extract the data as follows:

 director href = /name/nm0000487/ text = Ang Lee 

The most common problem in validating and using any microdata is webmasters mistakes. Pages do not always contain correct markup, sometimes it is incorrectly embedded on the page, etc. The validator always warns of an error (or simply that there is not enough data to use in Yandex services). We try to make error warnings as clear and specific as possible.

In doing so, we strive to make the most of the data, even if it contains errors. Therefore, sometimes we warn about the error, but our parser successfully retrieves even incorrectly marked data.

But with the format of RDFa no problems. Although there is a nuance. The fact is that RDFa allows entities to refer to each other, that is, provides an opportunity to create loops. However, if in the meaning of one entity there is a link to another, we perceive it as a link, and do not substitute the description of the entity. Such a solution does not contradict the official specification and prevents possible problems.

Thanks to different dictionaries with RDFa you can mark a variety of data. Some of the dictionaries of our validator can already understand, others only in the plans. Now one of the most popular dictionaries is OpenGraph . For him, in the test results, we even have a separate prefix - “og”.

results


Now our validator supports all popular types of micromarking. Even those that are not yet used in Yandex services. We want to make life easier for webmasters who are interested in the semantic web, and help them deal with errors. To do this, we, in particular, wrote a section of the help on the validator , an introduction to schema.org and a help on the use of microformats .

Some statistics


More than half of the requests to the validator contain Schema.org markup. In this case, approximately 9% of the validator finds errors.

Requests containing the hCard microformat are approximately 20%, and errors are detected in about 10% of cases.

About 4% of requests are RDFa , and about 10% of examples contain errors.

OpenGraph makes up only about 1.5% of requests, but there are practically no errors in this format. Only in 1% of cases, the validator reports incorrect markup. However, do not be surprised if you see warnings when checking the code containing OpenGraph. The fact is that Yandex is only accepting og: video so far, which the validator reports, and at the same time it can check other types of this markup.

And about 2% of requests is a check of examples located on the main page of the validator.

We are not going to stop there and plan to make the interface of the validator even more intuitive and simple. For example, visually separate errors and warnings related to the standard from errors and warnings of Yandex services. And, of course, as much as possible to use in our services the data obtained using micromarking. In the near future, we will continue to tell you useful information about the micromarking, how it is more convenient to embed it and where it is better to use it. And if you want to know something specific - write about it in the comments, and we will try to answer your questions in the following posts.

Source: https://habr.com/ru/post/165727/


All Articles