📜 ⬆️ ⬇️

A little bit on the development of web archives

A web archive is a system that periodically saves a site (or part of a site) in its original form. Most often this is done for the descendants, so that they can "play around, cry out and feel nostalgic ."

The basic requirement for a web archive sounds simple and comprehensive.

Offline version of the site must be fully functional. All original images, flash animation, embedded video, scripts, and so on should be visible in it. Ideally, it should not differ from the original.

')
For us developers, the expression “fully functional offline version” sounds very, very suspicious. You can even say - it sounds seditious. After all, a modern website without scripts does not happen, and scripts always generate uncertainty in behavior. But, as one character used to say: “You don’t need to rush to conclusions, otherwise the conclusions will pounce on you”.



Materiel



Honestly, in open sources of information read not reread. You can start with an article in wikipedia . Unfortunately, the implementation there is not very much, but more is said about the organizational and legal problems.

For those interested - I recommend to read. For the rest I will give a few terms for common development.

Web archive . For example, the most important archive of the Internet is archive.org . It scares its volume and complexity of use .

Web crawler is a program that can scroll through the pages of a steyt, following links. Currently, there are a lot of them. Perhaps the most famous robot whose visits are so welcome - Google Bot. For POC, we used ABot .

Building a system entirely requires storage, interfaces, and so on. But, unfortunately, everything will not fit in one article. Therefore, here I will tell only about the most difficult part - the algorithm for crawling the site and storing data.

Solution approach



How to solve the problem of archiving, I think, obviously. Sites are made for users. What is the user doing? Opens the page, remembers the necessary information, follows the link to the next page.

Let's try to make a virtual user - a robot - and automate the task a bit.

The “story” (hello, edgyle) of the robot’s work looks like this
the robot goes from page to page by reference (as a user). After the transition saves the page. Compiles a list of links that can be accessed from the page. Already passed links ignored. Unreleased - saves, and so on.


It looks very brief and very abstract. The first step in designing is always abstract, which is why it is the first. Now I will detail it.

Detail number 1


First you need to determine for yourself what will be the basic “indivisible” entity in our data model. Let it be called "Resource". I define it like this:

A resource is any content that we can download by reference.


That is, its main properties are the presence of a link (URI) and content that the server returns. For completeness, you need to supplement the description of the resource with metadata (type, link, last modified time, etc.). By the way, a resource may contain links to other resources.

Based on this concept, I define the general algorithm of the crawler.

  1. Preparation: Put the entry point URI in the processing queue
  2. Main loop: Select link from queue
  3. Download the resource at the specified link
  4. Do something useful with the resource
  5. Find out which resources the resource refers to (resources)
  6. Put them in a queue
  7. Go to the beginning of the cycle


In general, it looks logical. You can detail further.

Detailing number 2


Step zero. Training.


So, the robot is at the beginning of the process: it has only a link to the entry point to the site, the so-called index page. At this step, the robot creates a queue, and puts in it a link to the entry point.

Speaking abstractly, the queue is the source of tasks for the robot. Now the queue with its only element looks like this.



(Note: the processing queue for small sites can be stored in memory, but for large sites it is better to keep it in the database. In case the process stops somewhere in the middle).

Step one. Content analysis.


Select a resource from the queue for processing. (At the first iteration, this is the entry point). Download the page and find out what resources it refers to.

Structure - Resource Description

Here, in general, everything is simple. The page is located at site . The robot downloads it and analyzes the html content for links. (For types of links, see "Types of resources" below.) Several links are shown in the example: robots.txt (its robot ignores :), the “About Us” link is about.html, links to CSS and Javascript files, a link to Youtube video.

Step two


Filter unnecessary resources. To do this, the robot must provide a very flexible configuration interface (most of the existing ones do). For example, a robot should be able to filter files by type, extension, size, update time. For outbound links you also need to check the depth of nesting. Obviously, if a resource for some link has already been processed, then you should not touch it.

For the remaining, necessary resources - create a description structure and put in a queue. It is important to note that the structures at this stage are not completely filled: they contain only original (online) links. (That is, just like the original entry point at the zero step).

Referred resources

Important: at this stage, the content of the “Index page” page still contains original links, which means it cannot be used as an offline version. In order to complete the processing completely, you need to replace the links: they must point to saved offline versions of the resources. Using a queue is not difficult to implement: you need to put the task of updating the “Index page” links at the end of the queue. Thus it is guaranteed that by the beginning of the execution of this task all the referred resources will be processed.



Step Three


In general, this is the first step, only for each resources. (That is, the implementation of the algorithm will be simpler. Here the steps of the cycle are deployed to simplify the picture). At this step, the robot retrieves the next task from the queue ( download tasks added in the previous step).

It then finds out what transformations the resource needs to be exposed for offline use. In this example, all resources are simply downloaded, except for the “embedded” video: it is downloaded in a special way via youtube and saved as an avi file locally.

After that, local (offline) links for the referred resources are generated.

Local references

Important: just as in the first step, for each of their resources it is necessary to identify outgoing links and put them - correctly - in a queue.



(In this example, the CSS file refers to image.png).

Step Four


The next task in the queue after removing the resources from there (and, of course, image.png) is to update the links on the index page. Here, perhaps, for html-pages will have to change and structure. For example, an offline version of the video "embed" through some kind of player.

Local references

Step five
Go to the first step and continue until the queue is empty.

Scalability



The queue-based algorithm suffers from one drawback: resources are processed sequentially, and therefore, not as fast as it can be on a modern server.

Therefore, it is necessary to provide for the possibility of parallel processing. There are two options for parallelism:


The micro level raises blocking issues. If you recall, the task "update links" is placed at the end of the queue just for consistency. We expect that by the time of the launch of this task, all related resources have already received local links and are processed. In parallel, this condition is violated, and you will need to enter the synchronization point. For example, a simple option: run tasks to download asynchronously. When a task for updating links appears, wait until the active download tasks (the so-called Barrier) are completed. A difficult option is to introduce dependencies between tasks in an explicit form through semaphores. Maybe there are more options, not analyzed deeply.

Types of resources



Obviously, it is impossible to foresee all kinds of resources with which such a crawler has to deal. But we will try, at least in order to know what difficulties we will have to face.



Dynamic pages



The robot asks me a question: “But still - what have we decided about dynamic pages, scripts, and so on.”? The answer is: it turns out to check that not all pages are equally “dynamic”. I define the following levels of “dynamic” pages:



The greater the level of dynamism, the more difficult it is to save offline content:

Source: https://habr.com/ru/post/185816/


All Articles