📜 ⬆️ ⬇️

Pulling information from the URL, in the style of Slack and Twitter

Many people use Slack, Twitter and have seen such things:

How does it work and how to do it yourself?

For those who need it urgently and immediately: I wrote a ready server to return oembed-like URL information. A working option can be found here .

And now the points where you can tear out an interesting info:

1. Oembed


Everyone knows about <title> and <description>, we will not dwell on this.
')
There is such a format as Oembed . Many large portals have at their disposal an oembed endpoint. For example:


From oembed information you can get a much more complete squeeze of the page than trying to parse html, so the priority when parsing the URL is to search for oembed links.

This is described in clause 4 on oembed.com :

At the head of the page you need to add at least one link of the following type:

<link rel="alternate" type="application/json+oembed" href="http://flickr.com/services/oembed?url=http%3A%2F%2Fflickr.com%2Fphotos%2Fbees%2F2362225867%2F&format=json" title="Bacon Lollys oEmbed Profile" /> <link rel="alternate" type="text/xml+oembed" href="http://flickr.com/services/oembed?url=http%3A%2F%2Fflickr.com%2Fphotos%2Fbees%2F2362225867%2F&format=xml" title="Bacon Lollys oEmbed Profile" /> 

As you can see, here is the type of endpoint: xml or json. Accordingly, when parsing html, if we find a link to oembed, then we can exhale and take the necessary information from oembed ednpoint. Parsing oembed for Golang is implemented in my library .

2. Open Graph


This is additional metadata on the page that Google+, Facebook and others use to embed the content of the pages in their feeds. Read more here . This markup is used on a huge number of sites, even on Habré. For example, look at the source code for this post and search for 'og:'.

Parsing OpenGraph for Golang is implemented in my library (the most complete functionality in comparison with analogues).

3. We collect information bit by bit from what is


If neither oembed nor opengraph on the page is present, then we are content with the available data:


To rip out content, I use github.com/dyatlov/go-readability - this is the fork of the original go-readability with the addition of whitelabeled attributes (this is necessary for correctly pulling out images).

This is implemented at github.com/dyatlov/go-htmlinfo .

4. Generation of oembed for non-html resources


Links can be not only on pages, but also on pictures or video, on archives, etc. For such links, no information will naturally be obtained. So you have to generate it yourself.

In Golang there is such a thing as http # DetectContentType . Based on this information, you can get the type of content located at the specified address, based on the first few hundred bytes. Then, based on the type of content, you can take the following steps. In the case of images, I use image decoding of image headers and thus get their sizes, which I then return in the answer. All this is implemented in the appropriate library .

Defend ourselves


Tasks (except obvious ones):

1. Expand redirect'y and not go into an infinite loop. For example, bit.ly/1cWYIdC should be Habrom. Decision
2. Protection against attacks on local resources (see attack on the pocket ). Decision
3. Load only a limited amount of information (if a link to an ISO image of Linux came, then there is no need to download it all). Decision

Conclusion


I apologize for the large number of references to their repositories and a messy description. I tried to make the code understandable and break it into logical modules, which can then be reused independently. I hope that it is useful to someone.

The source code of the finished server is here . And he is here .

For what you can use: automatic loading of url preview for urls from comments, popup with preview in posts with urls and other things related to the display of the contents.

Source: https://habr.com/ru/post/269055/


All Articles