GrabDuck: How we make bookmarked articles

We welcome you reader. Not so long ago, a new parser / article extractor appeared on our GrabDuck service. Between ourselves, we call it GrabDuck Article Extractor 2.0, or abbreviated as GAE 2.0. Why so loud? The fact is, changes and improvements have accumulated so much that we had to completely throw out the old one, with which we have lived for the last year and a half and create a new parser of articles from scratch. And yes, for us this is a big and important change. What we did not like and what we did as a result is described under the cut.

So, we lived for a long time with the old parser, taken from the side in the form of a fork from some open source project. Yes, he was good and tried to do his job for all 100 (somewhere in one of the first articles we gave a link to it - if you are interested, look). And for those whose requirements do not exceed the average, we still recommend it to use - it will precisely handle.

')
But over time, they increasingly began to encounter various restrictions. As we know, our websites retain all sorts of things, we still come across a terrible legacy of the 2000s, when there were no special standards. In general, here our library was failing and we had to climb more and more into someone else's code and edit it for ourselves. Over time, perhaps the main complaint was that the library was like a Swiss knife, good, but I did everything myself: I loaded the document by url, tracked redirects, knew how to decipher various reference abbreviations, tried to determine the encoding, even if it was not explicitly specified, Parsil the document, identified the images and even tried to understand the date when the article was published. In general, not a library, but a fairy tale ... True, until something needed to be corrected or slightly changed. And we often had a choice: either to edit the code directly, or, after processing the document, to carry out another, our own, naturally duplicating in some places some logic from our original parser. And this decision was not easy - the creator of this open source library was clearly not a fan of testing and therefore everything worked smoothly until it was touched and did not make significant changes to the code.

And considering that the process of parsing and retrieving articles, for those who have never come across it, relies entirely on statistics rather than clear criteria, it was enough to slightly change the weights of the applied statistical model, and we immediately risked that once , some classes of sites simply cease to be processed correctly. After all, there is no general format - the whole article is just a big piece of html, in which somewhere inside there are several paragraphs of the text we need so much. So over time, we have our parallel world, when an already processed and seemingly already prepared article was once again banished, but through our own under-parser.

How this open source library worked in a multi-threaded mode was a separate and very sad song. And initially, on large imports, when the bill went to tens of thousands of bookmarks standing in line at the same time for processing, everything in our kingdom just stopped.

And it was lesson number 1 for us: when building your system, use independent components - bricks. And from them collect what is necessary for you. If something goes wrong, or there is a new interesting project that does the job better, you can always turn off the old and try new, without breaking the system and not risking that everything crumble at some stage. And then, believe our experience, if something in the world of numbers can go wrong, it will definitely go wrong almost immediately.

So at the end we decided - that's enough, it's time to take control in our own hands and write something ours, under our requirements and with a quality that satisfies us. And so a new component, GAE 2.0, appeared on our architectural diagram.
First of all, I wanted to build it in the form of a collection of independent components. For some steps, we needed parallel processing according to the principle: the more, the better, somewhere it was possible to do with one thread, and somewhere we wanted to speed up parallelizing work, but there were serious limitations on the number of simultaneously processed elements.

As a result, a sort of pipeline or pipeline emerged, where each bookmark turns into a full-fledged article, filled with meaningful data for the user with each step.

So, what steps do we need to take in order to make a full-fledged article from the link, which can already be shown to the user?

After deliberation, the areas of responsibility turned out to be the following: Actually Url Fetcher. Responsible for the direct download of the article provided by Url. Must understand all sorts of redirects, be able to work through SSL and with abbreviations of links. And it also needs to be parallelized, because the very wait for a response from the server takes years of computer time and something needs to be done about it. But the more the strategy, the better, here also does not fit: we will bombard the same site with requests and we will simply be banned. So we need some kind of optimum, which is called both ours and yours.

The result obtained should be checked for errors, and it can also be a duplicate of an article already existing on GrabDuck, and then it is enough just to link a new user to this existing article.

And only after that it is time to extract the data and prepare the final article, which we will show to our users. What is included here? Obtaining meta-information, and these are headings, images, calculation of tags and language of the document Of course, we need the content itself for full-text search, we also need to generate a text snippet briefly representing the document itself.

After that, the document is ready for use and available for search on GrabDuck.

So, the new parser is ready and cheers, all new bookmarks will already go through it and we will finally get what we wanted! But the big question that may arise from the reader is: what will happen to existing bookmarks? After all, they were ALREADY processed and ALREADY saved in the system! Do they remain untouched? And our answer is no! First of all, the user always has the opportunity to select a bookmark and forcibly update it. To do this, simply select the appropriate item from the context menu. It looks something like this.

Well, or just wait a bit. One of the great features of GrabDuck is that we periodically check all the bookmarks: is everything OK, are the websites still alive, have new comments appeared on the page, etc. So sooner or later, your bookmarks will be updated and passed through GAE 2.0.

Today this is all we wanted to tell you. Leave your comments and see you soon.

Source: https://habr.com/ru/post/309410/

All Articles

GrabDuck: How we make bookmarked articles

More articles: