DIY readability

Since the ~~Great Chinese~~ Roskomnadzor hasn’t learned much about circumventing locks on the Internet , I still want to tell something strange about my work, I’ll tell you about the reimplementation of the algorithm similar to Readability using Node.js and the Beijin Institute of Technology .

What is it all about?

Readability is a radical continuation of the AdBlock idea of removing unnecessary elements from websites. Where AdBlock tries to demolish only the most useless things for the user (mostly ads), Readability removes at the same time scripts, styles, navigation and everything else that is unnecessary. Previously, this type of page was called “printable version”, although in actual fact the text is intended to be read (hence the name Readability - “Readability”).

Lyrical digression about parsers

The main characteristic of the parser sites, or other poorly structured formats - is the amount of knowledge about the particular cases of using the format in the wild.

A degenerate case of having all the knowledge is the parser of a single site. Those. if we want to steal articles from Habrahabr, for example, to print them at night on an inkjet printer and sacrifice to Satan - we can look at the existing layout and easily determine that the post title is h1.title .
')
A program written in this way will almost not be wrong; for each site different from Habrahabr, you will have to write a new program.

Degenerate ideal case: the parser does not even know in what format it received the data. An example of such a program is strings (it exists in most non-game operating systems).

If you apply strings to some non-readable file, you can get a list of everything that looks like the text inside this file. For example, the command

 strings `which ls`

will print a bunch of lines for formatting inside the ls binary, and help.

 %e %b %T %Y %e %b %R usage: ls [-ABCFGHLOPRSTUWabcdefghiklmnopqrstuwx1] [file ...]

The less knowledge, the more universal the parser.

What is already there

The source code of the first version of Readability is published and is a chilling tangle of regular expressions. This in itself is not bad, but special cases are simply awful. I would like an algorithm that has much less knowledge of popular sites on the Internet (see above, “lyrical digression”).

The current version of Readability is closed and hung with buns of varied demand. There is an API .

There is a fork of the first version of Readability by Apple (the Reader feature in the Safari browser). The source code is not very open, but you can look at it, there are even more regular expressions and special cases (for example, there is such a variable - isWordPressSite ).

Problems of the original script - the complexity of the modification, arcade heuristics. Mostly works, but requires a non-trivial fine-tuning with a file. The Apple version is also incomprehensibly licensed.

What to write

Parser sites with minimal knowledge of markup. Input data - one page of the site, or a fragment of the page. The result is a textual representation of the input data.

An important criterion is universality: the program will work both on the client and on the server. Therefore, we are not attached to existing DOM implementations, but build our own data structure (it also works faster than a full-fledged DOM, because we need data from, for example, the nose).

For the same reason, the program will not be able to independently download pages from the Internet, store the results on a disk, have a user interface, and embroider a cross.

Algorithm Life and Adventures

The search engine found several articles on the algorithmization of the process described above. Most of all I liked these Chinese ^PDF .

My formulas have turned out a little different, so I’ll tell you briefly about my version of the Chinese algorithm.

For each tag in the document:

We count

Here chars is the amount of text (characters) inside the tag, hyperchars is the amount of text in the links, tags is the number of nested tags (all metrics are recursive).
We consider the amount of estimates
The sum of the estimates of children of the first generation (that is, not recursively).
Found a tag with a maximum amount
This is a high probability container of the main text. Or the longest comment. In any case, there is inside the letter, it's cool.

Space for labor feat

Further optimization. I will describe several cases, but in general this is the most interesting topic, you can chat in the comments.

Garbage in the main text. All sorts of pseudo-bloggers like to shove the numerous buttons of their social contacts, twitter, etc. right in the body of the post. unnecessary things. Such buttons have a score (score, see above) tends to zero, according to this principle they can be demolished.

Just in case, I also check that after removing the garbage, the parent's score has increased, if not (or has grown insignificantly) - then I’m not deleting, is there anything there.

Html The algorithm does not use knowledge about the structure of the document, they can now be added to improve (or speed up) the work of the program. Ie, let's say, pessimize <footer> and <nav> in advance, or add annotations to invisible elements (in the browser) and skip them completely - there’s really room for activity, I haven’t implemented anything yet.

Text signals. If there are commas, periods and other punctuation marks in the text, it is most likely connected text (as opposed to navigation, for example). Such heuristics was in Readability.

Here you need to pay attention to the fact that the punctuation marks in different languages are still different, and the commas in Chinese ("，" Unicode U + FF0C) differ from the symbol "," (ASCII 44).

What happened, how to use

I called the result plainly readability2, laid out in npm .

Short about tests

Testing such a thing is imperative to avoid regressions (and in general, automatically testing programs is cool).

A certain problem arises here: the readability test is a saved page from a completely extraneous site, plus the “reference” text that has been torn from it. I do not really understand how to distribute it in such a way that right traffickers would not try to destroy me for illegal copying of sites and texts.

If someone knows the correct answer, please write in the comments. Now the tests live in a closed repository, but they really want to be free.

Sources without tests: GitHib

Usage example

For illustrative purposes, I wrote a page demo.html , in which two lines of text among all navigation.

The text is called "Name." Content part:

The whole neighborhood quietly watched the miracle of God:
Pop Ignatius tilibonkal his church cause.

(Public domain)

By the way, I refuse property rights to this literary work, thus transferring it to the public domain (public domain). Now everyone can distribute and use the full text without restrictions.

This should be the result of the program run. If the result is not the same, then everything is broken.

But the source of the program demo.js with comments. The parser is sax by Isaac Z. Schlueter .

Documentation, she's an API

Constructor:

 var reader = new Readability

Takes nothing.

SAX interface:

 reader.onopentag(tagName) // <> reader.onattribute(name, value) // = reader.ontext(text) //  reader.onclosetag(tagName) // </>

All arguments are strings.

To get the result:

 var res = reader.compute(), text = reader.clean(res)

At the exit: res.heading - the title of the article and text - the main text without formatting.

Instead of reader.clean you can write another formatter, then you will get not text, but simple markup, for example.

Conclusion

The program works. She is still a bit scared to use, because There are only about 20 tests, but I'm working on it. Updates will be. Patches are welcome, except for any silly ones. Github MIT license, I forgot to upload it to the repository.

Important note: the picture on the left has nothing to do with the post. So if it doesn’t load and you don’t see any pictures on the left, don’t be upset.

It is better to write in the comments what you think about all this.

Source: https://habr.com/ru/post/220983/

All Articles