📜 ⬆️ ⬇️

Parsim microformats

Microformats are a way to embed specific semantic data in the HTML we use today. The first question the XML guru would ask is: “Why use HTML if XML allows you to create the same semantics?” I will not list all the reasons why XML would be the best or worst choice for encoding data, or why microformats were HTML is selected as the base. This article will focus on how the basic parsing rules work and how they differ from XML.

HTML Contact Information



One of the most popular and well-established microformats is hCard. This is a vCard representation in HTML (“h” in hCard is short for “HTML vCard”). You can read more about hCard on microformats wiki . A vCard contains basic information about a person or organization. This format is widely used in address book applications as a way to back up and share contact information. By the standards of the Internet, this is the old format, its specification is 1998 RFC2426. This was before XML, so the syntax is plain text with some delimiters and start and end elements. Take for example my information:

  BEGIN: VCARD
 FN: Brian Suda
 N: Suda; Brian ;;;
 URL: http: //suda.co.uk
 END: VCARD 

')
This vCard file contains the BEGIN: VCARD and END: VCARD elements, which are containers that make the parser know when to stop collecting data. There can be several vCards in one file, so this method beautifully groups the data into clear vCards. FN stands for “formatted name”, which is used as a name for display. N is a structured name in which the proper name, surname, middle name (middle name), prefixes and suffixes are coded, and all this is separated when using a semicolon. Finally, the URL is the address of the site associated with this contact.

If we had to encode all this in XML, the result would probably look something like this:

  <vcard>
     <fn> Brian Suda </ fn>
     <n>
         <given-name> Brian </ given-name>
         <family-name> Suda </ family-name>
     </ n>
     <url> http://suda.co.uk </ url>
 </ vcard> 


Let's take a look at how we can mark the same data in HTML using microformats, which use rel, rev, and class attributes to encode semantics. The class attribute is used in much the same way that elements are used in XML. So the previous XML example is laid out in HTML like this:

  <div class = "vcard">
     <div class = "fn"> Brian Suda </ div>
     <div class = "n">
         <div class = "given-name"> Brian </ div>
         <div class = "family-name"> Suda </ div>
     </ div>
     <div class = "url"> http://suda.co.uk </ div>
 </ div> 


If this were all microformats are capable of, it would not be so interesting. But no, microformats allow you to benefit from the semantics of existing HTML elements in order to explain where coded data can be found. In this example, each element is a <div>, but this is not necessary. This makes extracting data from HTML a little more difficult for the parser, but the author of the document becomes easier. Microformats do not force authors to change the current structure of HTML or the style of publication. In the end, people writing HTML are orders of magnitude larger than parser writers, so why not simplify the lives of authors?

When I look at the previous XML example, I don’t like that I see “Brian Suda” twice, once in FN and then again in N. This is not a problem with HTML, we can combine these XML elements using space separated values class attribute. A little known fact is that the class, rev and rel attributes can have a list of values ​​separated by a space. And if we combine FN and N, we get something like this:

  <div class = "n fn">
     <div class = "given-name"> Brian </ div>
     <div class = "family-name"> Suda </ div>
 </ div> 


Now the N property still has child properties, and the FN contains the same value as before. As we remember, HTML compresses spaces, so the FN value is still “Brian Suda”, even though it is divided between two elements with spaces inside the divs.

This is how we designated the ability to combine properties with the same magnitude. The next thing that makes my eyes sore in the XML example is the way the URL is displayed, it does not look natural. In XML, we are talking about data, but HTML is shown to people in a browser. Coincidentally, we have a <a> element that has an href attribute that takes a URL value, and it also has a text value to display more human-friendly text. We can continue to grind our HTML example by changing the <div> element to <a>:

  <a class="n fn url" href="http://suda.co.uk">
     <span class = "given-name"> Brian </ span>
     <span class = "family-name"> Suda </ span>
 </a> 


After switching to the <a> element, we need to replace the child divs with spans, because the <a> element can contain only children of the inline-level. Microformats do not force authors to use certain elements, but it is recommended to use the most semantic for each case. In the case of URLs, it is best to use <a>, so the parsing rules will change slightly (we will discuss this a little later).

The final hCard microformat in HTML will look like the following:

  <div class = "vcard">
     <a class="n fn url" href="http://suda.co.uk">
         <span class = "given-name"> Brian </ span>
         <span class = "family-name"> Suda </ span>
     </a>
 </ div> 


In my opinion, this is much more intuitive, simple and compact than the example XML at the beginning. People already publish blogrolls and links in this way, and all browsers recognize and stylize this information, and it’s easy to put it inside feeds.

Parsim with XSLT



Microformats are designed to work with HTML 4 and higher. The disadvantage of using XSLT is that the document must be correctly formed, which is not necessary in HTML 4. In HTML 4, the tags <br>, <img> and <hr> can be used without closing tags. If you used another technology to extract microformats, like REGEX or DOM, this would be a different question, but with XSLT we first need to clean up the HTML. There are two simple ways to do this: TIDY or a function like HTMLlib or loadHTML , any of them will load the HTML document and convert it to valid for XSLT.

Now that we know that we have a well-formed HTML document, we can start extracting these microformats. The following is a very raw XSLT, far from perfect, but first you should have enough. For further information, you can read the microformats.org wiki page on parsing, or use XSLT templates that do most of the hard work of extracting data (they are available on hg.microformats.org ).

All hCard data is contained within an element that has the class “vcard”. In our example, this is a <div>, but it can be any element, so we start with this:

  // * [@ class = "vcard"] 


This XPath expression searches for any item in the tree whose class is “vcard”. At first glance, he wonders that this will find all hCards, but the problem is that the class value may be a list of values ​​separated by spaces. Thus, the code class = "vcard myStyle" will not be selected by this XPath expression. To fix this, we use the contains function:

  // * [contains (@class, "vcard")] 


This is better, now we will find any element in which the class attribute contains “vcard”. In the expression class = “vcard myStyle”, “vcard” will be successfully found, but there is another problem. The contains function is not safe because it is a search by substring. So the class = "my-vcard" will be found by the function contains () in the same way as the class = "vcard", despite the fact that my-vcard is not the correct name of a property that would mean that we have a microformat hCard. False coincidence. To fix this, you have to slightly conjure and surround the values ​​you are looking for with spaces, and then look for a new value wrapped in spaces. It sounds difficult, but in reality it is not.

  // * [contains (concat ("", @ class, ""), "vcard")] 


With spaces, the class “my-vcard” looks like “my-vcard” and will not contain the substring “vcard”, which solves the problem of substrings. In another case, the class “vcard myStyle” will turn into “vcard myStyle”, which contains “vcard”, so the values ​​separated by spaces are also found with this technique.

Now that we know how to find the data, let's go through each hCard using XSLT and start outputting it as a vCard. By this time, it's easy to see how using XSLT makes it easy to convert HTML data to almost any format. For example, another HTML, XML, RDF, text vCard, CSV, SPARQL, JSON, or whatever else your heart desires.

The for-each operator will find all the hCards on the page and create a vCard for each. When creating a vCard, it applies the search patterns inside the hCard, such as FN, N and URL.
  <xsl: for-each select = "// * [contains (concat (" ", @ class," ")," vcard ")]">
     <xsl: text> BEGIN: VCARD </ xsl: text>
     <xsl: apply-templates />
     <xsl: text> END: VCARD </ xsl: text>
 </ xsl: for-each> 


FN is a simple template that retrieves the value of an element with a class containing FN.

  <xsl: template match = "// * [contains (concat (" ", @ class," ")," fn ")]">
     <xsl: text> FN: </ xsl: text> <xsl: value-of select = "." />
 </ xsl: template> 


Template N is a bit more complicated. First, he needs to find an element with a class containing N. Then he searches for child elements containing sub-properties of N, such as first and last name, and displays their values.

  <xsl: template match = "// * [contains (concat (" ", @ class," ")," n ")]">
     <xsl: text> N: </ xsl: text>
     <xsl: value-of select = "// * [contains (concat (" ", @ class," ")," family-name ")]" />
     <xsl: text>; </ xsl: text>
     <xsl: value-of select = "// * [contains (concat (" ", @ class," ")," given-name ")]" />
     <xsl: text> ;;; </ xsl: text>
 </ xsl: template> 


The URL pattern uses the choose element to determine where the most semantic information for the URL value is located. It checks whether an element with the class “url” is a <a> element. If yes, then the address value is extracted from href , otherwise the string content is used.

  <xsl: template match = "// * [contains (concat (" ", @ class," ")," url ")]">
     <xsl: text> URL: </ xsl: text>
     <xsl: choose>
         <xsl: when test = "local-name () = 'a'">
             <xsl: value-of select = "@ href" />
         </ xsl: when>
         <xsl: otherwise>
             <xsl: value-of select = "." />
         </ xsl: otherwise>
     </ xsl: choose>
 </ xsl: template> 


The element <a> and many others carry their semantics in themselves. In the original HTML example, the URL was encoded with a <div>, in this case, the content would be extracted, and the value of the URL would be the same. This is just one of many signs that microformats differ from XML. Parsing microformats data depends on the type and element of HTML they were encoded.

It was a very brief overview of extracting data from microformat. There are also rules depending on the type of vCard property and on which the HTML element is built. For more information, visit the microformats wiki , my PDF-book, Using Microformats , or you can always email me, and also subscribe to the microformat mailing list if you have questions.

Source: https://habr.com/ru/post/31301/


All Articles