Many people use Slack, Twitter and have seen such things:
How does it work and how to do it yourself?
For those who need it urgently and immediately: I wrote a
ready server to return oembed-like URL information. A working option can be found
here .
And now the points where you can tear out an interesting info:
1. Oembed
Everyone knows about <title> and <description>, we will not dwell on this.
')
There is such a format as
Oembed . Many large portals have at their disposal an oembed endpoint. For example:
From oembed information you can get a much more complete squeeze of the page than trying to parse html, so the priority when parsing the URL is to search for oembed links.
This is described in
clause 4 on
oembed.com :
At the head of the page you need to add at least one link of the following type:
<link rel="alternate" type="application/json+oembed" href="http://flickr.com/services/oembed?url=http%3A%2F%2Fflickr.com%2Fphotos%2Fbees%2F2362225867%2F&format=json" title="Bacon Lollys oEmbed Profile" /> <link rel="alternate" type="text/xml+oembed" href="http://flickr.com/services/oembed?url=http%3A%2F%2Fflickr.com%2Fphotos%2Fbees%2F2362225867%2F&format=xml" title="Bacon Lollys oEmbed Profile" />
As you can see, here is the type of endpoint: xml or json. Accordingly, when parsing html, if we find a link to oembed, then we can exhale and take the necessary information from oembed ednpoint. Parsing oembed for Golang is implemented in
my library .
2. Open Graph
This is additional metadata on the page that Google+, Facebook and others use to embed the content of the pages in their feeds. Read more
here . This markup is used on a huge number of sites, even on Habré. For example, look at the source code for this post and search for 'og:'.
Parsing OpenGraph for Golang is implemented in
my library (the most complete functionality in comparison with analogues).
3. We collect information bit by bit from what is
If neither oembed nor opengraph on the page is present, then we are content with the available data:
- <title> - for the title
- <meta name = "description"> - for description
- <link rel = "image_src"> - for image
- * if there are no pictures, then we tear out the first picture from the main text
- * if there is no description, then we tear out a part of the text from the main content as a description
To rip out content, I use
github.com/dyatlov/go-readability - this is the
fork of the original go-readability with the addition of whitelabeled attributes (this is necessary for correctly pulling out images).
This is implemented at
github.com/dyatlov/go-htmlinfo .
4. Generation of oembed for non-html resources
Links can be not only on pages, but also on pictures or video, on archives, etc. For such links, no information will naturally be obtained. So you have to generate it yourself.
In Golang there is such a thing as
http # DetectContentType . Based on this information, you can get the type of content located at the specified address, based on the first few hundred bytes. Then, based on the type of content, you can take the following steps. In the case of images, I use image decoding of image headers and thus get their sizes, which I then return in the answer. All this is implemented in the
appropriate library .
Defend ourselves
Tasks (except obvious ones):
1. Expand redirect'y and not go into an infinite loop. For example,
bit.ly/1cWYIdC should be Habrom.
Decision2. Protection against attacks on local resources (see
attack on the pocket ).
Decision3. Load only a limited amount of information (if a link to an ISO image of Linux came, then there is no need to download it all).
DecisionConclusion
I apologize for the large number of references to their repositories and a messy description. I tried to make the code understandable and break it into logical modules, which can then be reused independently. I hope that it is useful to someone.
The source code of the finished server is
here . And he is
here .
For what you can use: automatic loading of url preview for urls from comments, popup with preview in posts with urls and other things related to the display of the contents.