How does the web-spider

Suppose we face the following task: to collect information about advertisements from various sites in different categories. In the future, this information will be used to monitor and analyze the market, notifying about any events in this market. It looks like the creation of a mini-search engine.

In fact, we have 4 subsystems:
1. Launch service for plugins that collect and retrieve information
2. Temporary data storage
3. Data Index
4. Applications for working with extracted data, for example, report generator

Consider each system in sequence.
')

Overview of Subsystems

1. Startup service for plugins

It is a service that, according to the existing schedule, launches plug-ins from its library. In case of an error of a separate plug-in, the service performs certain actions (logging, developer notification by e-mail, restart, plug-in shutdown, etc.)

2. Temporary data storage

The auxiliary system provides storage and access to pages downloaded from the Internet. We cannot explicitly use the SQL database: data is downloaded from the Internet by continuous streams, there are a lot of them and they appear very quickly, besides, they are constantly accessed (read, modified, deleted), also in several streams and high frequency There was an attempt to use NoSQL-databases, but they also could not withstand the load. As a result, hybrid storage was made: information about the records is stored in the SQL database, and the HTML content itself is in large files. Periodically, a special plugin “cuts out” deleted pages from these files.

3. Data Index

An index is required for quick access to data. Suppose we are going to view the number of ads in a particular category for the selected date. To make a similar sample from the database and temporary storage each time is very expensive for server resources. In addition, some data will be redundant. Therefore, an index is built, where only IDs of ads and their categories for each day are entered.

4. Applications for working with extracted data

It's all clear. These are applications with which the end user works. The collected data, thanks to the built indexes, can now be quickly and conveniently viewed. This can be a web application where the user selects report templates and selection criteria and receives data in the form of tables or graphs. Maybe it will be daily letters to managers, reflecting the categories of ads in which there is increased activity.

Data production pipeline

Imagine getting data in the form of a pipeline.
1. A set of web spiders that download search results from source sites. These are the pages to which the search bot does not usually get (with the exception of SEO pages, but the number of ads in them is likely to be incomplete and contains only the most popular categories). For each site you will have to write your own plugin with your own query logic. Sometimes this is a GET request with parameters in the querystring, sometimes you have to send POST or even send parameters via cookies. To write such a plugin will help any HTTP-analyzer built into the browser. The task of the plug-in that downloads the search results is to cover all categories of ads on the site, visit all of its pages and save the contents to a temporary repository. In order to avoid looping when moving through paging (paging - moving through the result pages), it is recommended to compare the page content with the previous one.

2. The next stage of the pipeline is a plugin that extracts from the loaded pages links to pages with a detailed description of the advertisement (product, service, depending on the specialization of the site). Here you can try to analyze all the links and select the necessary ones from them or go in a simple way and enter your description of such a link for each site. The easiest way is to use your XPath expression to extract links from each site. The plugin extracts the links, saves them to the database and marks the pages with the results as processed. Additionally, the plugin can check if the page is an error message or that the search result has not found any results.
In case you have a question for what you need to store pages with search results, I’ll tell you what it is for. There are very large sites, downloading of which takes a long time, and plug-ins are often forced to restart as a result of an error. Errors will be, they may be related to the operation of the site, the performance of your server, network errors. In this case, it is logical to continue the download from the place where the plugin was interrupted. In case something has changed on the site and the links to the pages with details cannot be retrieved, it will be enough just to change the XPath or the extraction algorithm and restart the plugin. It will quickly process all stored pages without re-downloading.

3. Next comes the “universal” plugin. It receives a list of links to pages with details and downloads them , saving the pages in temporary storage. In general, all pages (with search results and details) have a period of obsolescence, after which they are considered invalid and will be deleted. And their processing is made from the old to the newer. After processing, they are marked as processed. Download links, in case of failure, are marked with the time of the download attempt so that you can try again later. The cause of the error may be temporary unavailability of the server or proxy. To download from individual sites may need a proxy. Well, we should not forget about the load on the server of the data source site, it is necessary to send requests no more than 1 time per second.

4. Having the page with the details of the ad, it remains only to extract the necessary information from HTML . For this, you will again have to write your own plugin for each site or come up with a more intelligent universal one. In the simplest case, you can again use XPath for each field of the document. For parsing HTML, a library like HtmlAgilityPack is suitable. The extracted information must be placed in a universal format (for example, XML) and passed on to the pipeline.

5. From the set of xml-files with all the data, you can manage them in accordance with the tasks set, for example, to build the corresponding index , which specific applications will already use.

Conclusion

I very briefly introduced you to the real-life system, which is used for different verticals of sites. In fact, when creating this system, which has been developing for 4 years, a lot of pitfalls appeared. If this topic is met with a respected community with interest, I will write something more detailed on this topic.

Source: https://habr.com/ru/post/125754/

All Articles