Disclaimer: this topic may be partly self-advertisement, “water” and nonsense, but, most likely, this is just a classification of information and experience accumulated over two years of work in the field of scraping, for yourself and those who are interested.
Karma is not chasing, it is enough.
Under the cut - a small post about the modern market of crawlers / parsers, with classification and features.
Sabzh
We are talking about "spiders", or programs that collect information on the web. Spiders are different - most climb on the web, some of them are torrents, some are fido / ed2k and other interesting things. The essence is the same - in a convenient way for the customer to provide the information he needs.
')
Unfortunately, S. Shulga (
gatekeeper ) overestimated this industry too much - information mining is a popular affair, but nevertheless, there is little use of AI technology, and automated advisers are far away. Spiders are generally divided into several categories, distinguished by the complexity of the methods used.
Classification
Simple crawlers
Cheap, simple scripts, usually in PHP, the task is to consistently deflate the site, save the prices, attributes, photos to the database, it is possible to process. The cost of projects can look at the bases of freelancers, it is usually ridiculous. Mostly disposable projects. Banned by IP or speed requests.
Group Crawlers
A similar project I realized for cenugids.lv. In this case, many (50+) crawlers use the same code base, or rather, this is one crawler with interfaces for several sources (for cenugids.lv these were stores). It is mainly used to collect information from similar sources (forums, shops).
Behavioral Crawlers
It implies the disguise of a bot as a person The customer usually asks for a certain strategy of behavior - collect information only at lunchtime, 2 pages per minute, for the work week 3-4 days per week, for example. An interrupt for a “vacation” and a change in the “browser version” in accordance with the releases can be included in the TOR.
Crash Crawlers
Technically the most cumbersome solution, used to scrape something with the size of c ebay. Usually consists of several parts - one pulls out from the source of the place where it is worth walking (for example, for a store, these are categories and pages). This process is quite rare, because This information is fairly unchanged. Further, at random intervals, the spider walks through “interesting places” and collects links to data (for example, products). These links are again processed with random delays and entered into the database.
This process is not periodic, it goes on constantly. In parallel with it is checking old links - i.e. let's say every 5 minutes we choose 10 cached goods from the base and check if they are alive, whether the price and attributes have changed.
In this, technically the most cumbersome decision, the customer receives data not about the snapshot of the source at some point, but more or less up-to-date information from the database of the crawler itself. Naturally, with the last update date.
Problems and methods
Detection
Simply enough (at least looking at the statistics) to understand what your site is being pumped out. The number of requests equal to the number of pages - what could be more noticeable? It usually costs by using caching crawler and crawling graphics. Naturally, it is impossible to stand out for the attendance of the target site.
IP Ban
The simplest thing you can run into at the beginning of the war with the admin. The first way out is to use proxy. Minus - you need to maintain your infrastructure, update the list of proxy, transfer to the customer, and do so so that it does not collapse in one moment. With one-time orders, of course, this disappears. Although, it took me a week to implement such an infrastructure with interfaces.
The second option is Tor. Great P2P network anonymization, with the perfect interface where you can specify the desired country and exit point. The speed, in principle, with cache solutions, is not so relevant. The performance is quite good - I still have one client banning all exit points, iptables rules are already over 9000 (at the time of writing 9873), but there is still no result ...
Registration / Authorization
A trivial problem solved as you gain experience. Logged / recorded cookies / entered / parsim. Captcha breaks just as well.
Departure to infinity
A parser can go crazy if the site somehow generates an infinite number of links. For example, adding osCsid (OsCommerce SessionID) / PHPSESSID each time can make the crawler perceive the link as new. I saw stores that, in general, generated pseudo-random links when refreshing (thus, for search engines, one product was placed on 50+ pages with different URLs). Finally, bugs in the source can also generate an unlimited number of links (for example, a store that showed the next link and +5 pages from the current, even somewhere on a 7000+ blank page).
Encodings
Oddly enough, the biggest gap is encodings. cp1251? HTML entities? FIVE kinds of "gaps" in the unicode table? And if the customer requests XML, and one wrong character kills simplexml?
The whole list of encoding errors is probably too lazy for me to specify. I will say simply - in the post-processing of data, in my cravler, the processing of encodings is almost half.
Platform
People love PHP. Typically PHP + simplexml, or PHP + DOM / XPath. XPath is generally indispensable, but PHP systems have two big drawbacks - they eat and fall off. At 512 megabytes per crawler is a normal phenomenon when using mbstring, not to mention coredump just when trying to create a +1 tag in XML. When processing small sites, this is overlooked, but when 50+ megabytes are pumped out of the source at a time ... Therefore, basically, serious players with PHP are leaving.
My choice is Python. In addition to the same XPath, there are libraries for ed2k, kazzaa, torrents, any databases, perfect string processing, speed, stability, and OOP. Plus, the ability to integrate a mini-server there for issuing data to the client allows not to put extra on the server and remain inconspicuous - for example, did not have time to pick up the issue 15 minutes after midnight - kick it.
Conclusion
If anyone is interested, I can describe in a separate article the methods of breaking the captcha, bypassing the protection of the User-Agent, analyzing the issue of servers and parsing non-web sources. Have questions? Welcome to kammenti or lichku!