Do not protect the site from scraping: resistance is useless

Over the past decade, I have implemented many projects related to aggregation and content analysis. Often, aggregation involves the removal of data from third-party sites, that is, scraping. Although I try to avoid this term. He turned into a label with which many delusions are connected. The main misconception is that you can block web scraping with X, Y, Z.

tl; dr; It is impossible.

From a business perspective

Last week, I met with a senior executive from the industry in which I am developing my GO2CINEMA business. Without a doubt, this is one of the smartest and most knowledgeable people in the film industry.
')
The GO2CINEMA business model is based on aggregating, from various sources, information about the schedule of sessions, free seats and ticket prices, as well as fulfilling requests to purchase tickets on these websites on behalf of the user.

I consulted with this person about finding investments. He offered his help and asked to prepare an analysis of all the ways to block my current business, including scraping content (from a technical and legal point of view). I prepared the necessary documents and shared with him before our meeting. His reaction was something like this:

Yes, thorough research. But still there are ways to block you. * grins *

No, man, no such ways.

Real users are no different from bots

People far from IT, there is such an idealized idea of programming as in computer games of the 80s - you put on a virtual reality helmet and immerse yourself in the Network. In fact, all information and all interactions are zeros and ones. There is nothing human here. There is no difference between data entered by a computer or person.

Inspecting Web Traffic

I will say in a simple way. As long as visitors can access content on your site, the bot can get the same access. All technological solutions against scraping will prevent real users as much as bots.

If you do not believe, let us consider all the technical ways that a website can try to block a bot using the example of my business.

Technical countermeasures

Although the technical specialist some of these theses seem stupid, but investors really expressed all these fears - and I had to respond. So really bear with me.

User Agent Blocking

Each HTTP request contains HTTP headers, including the user agent , the HTTP client identifier. Therefore, the cinema can identify bots using information from HTTP headers - and block them.

Solution: fake HTTP headers to simulate real users.

Example: GO2CINEMA bots use HTTP headers that mimic real user sessions (for example, in the Google Chrome browser). HTTP headers are randomized between scraping sessions.

Conclusion: it is impossible to block GO2CINEMA bots using HTTP metadata sent by the client, such as HTTP headers, without blocking real users.

IP blocking

The cinema may try to determine and block the IP addresses of the GO2CINEMA bots.

Solution: “fake” IP addresses (using a proxy).

Mass identification

Example:

GO2CINEMA uses a combination of request scheduling and IP rotation to avoid identifiable patterns of bot behavior. Here are some of the precautions:

Randomization of IP addresses.
Allocation of IP addresses that are geographically as close as possible to the cinema.
Saving the dedicated IP address during the scraping session.
The proxy pool changes every 24 hours.

It is worth noting that in the current installation there is one drawback: IP addresses (proxies) are registered to various data centers, and not to home addresses, like real people. Theoretically, the cinema can get a list of subnets of all data centers in the UK - and block them. This will successfully block bots in the current settings. But:

It will cost. For example, such services are provided by MaxMind (a database with IP addresses of anonymizers, proxies, and VPN, price not disclosed) and Blocked ($ 12,000 per year).
This can lead to blocking real users.

Netflix is an example of a provider that blocks the IP addresses of known VPNs and proxies.

If the theaters start blocking the IP addresses of the data centers, you will have to use the home users' IPs via home address proxies like Luminati . This approach has two drawbacks:

Cost (our current traffic will cost £ 1000 per month).
Reliability. The performance and speed of proxy home addresses is difficult to predict.

Some cinemas have already tried to block the IP addresses of our bot. The source told us that cinema X thinks (or at least thought) that it successfully blocked our IP addresses. But it is not. Activity bot GO2CINEMA did not stop. It seems that cinema X blocked someone else who collected the same data.

It is important to emphasize that it is theoretically possible to distinguish HTTP requests from people and bots by surfing patterns (see the “Invisible Captcha” section). But it will be very difficult to identify HTTP requests from the GO2CINEMA bots (for the reasons given in the section “Block User Agent”).

Conclusion: it is extremely difficult to block GO2CINEMA bots by the black list of IP addresses, because 1) it is extremely difficult to identify bots and 2) we have access to a large number of IP addresses of data centers and home users.

Blocking by IP will not prevent our bots from continuing to scraping cinema sites.

Use captcha

The cinema can add captcha to restrict access to certain sections of the site (for example, displaying occupied seats in the auditorium) or restricting certain actions (for example, completing a payment transaction).

Solution: API for captcha solution.

Fully automated public Turing test for distinguishing between computers and people (CAPTCHA)

Captcha will only add inconvenience to ordinary users. All captcha methods (including for Google's reCAPTCHA) are easily managed with third-party services like 2Captcha . In these services, real people solve the tasks assigned to our bot. The cost of services is minimal (for example, 2 pounds for 1000 tasks).

Conclusion: adding a captcha will not prevent our bots from continuing scraping cinemas.

Invisible Captcha

Cinemas can use behavior-based identification and blocking mechanisms (the so-called “invisible captcha”).

Invisible captcha uses a combination of different variables to assess the likelihood that a particular client's interactions are automated. There is no single recipe for how to implement it. Different providers use different parameters for profiling users. This service is provided by some CDNs (for example, Cloudflare) and traditional captcha providers, like Google reCAPTCHA.

According to the former head of Google’s click-click recognition department, Schumann Gosemajumder, this opportunity "creates a new type of problem that very advanced bots can still get around, but it generates much less difficulty for a real person."

Conclusion: identification by behavior profiles will not prevent our bots from continuing to scraping cinemas, this is just another problem that needs to be circumvented.

Email Verification

, , .

: «».

, — Kidmograph

:

. GO2CINEMA go2cinema.mail . (, john1@go2cinema.mail). , , GO2CINEMA.

:

, .

go2cinema.mail, - :

.
, (, Mailinator).
(Yahoo, Gmail . .).
.

: .

.

: .

(, , ), (, Twillio) .

, (, 1 ).

, , :

SMS .
, .
( , ).

, .

: .

BIN

(BIN).

: (, Barclays).

:

GO2CINEMA . MasterCard Entropay. Entropay , . . BIN 522093. , BIN.

BIN . . MasterCard 5.10.1:

5.10.1
- , . , .

MoviePass :

« MasterCard, AMC , MasterCard. , MasterCard». —

, . - , , .

: BIN .

.

, . . .

, -, :

.
API.
, .
.

: .

API API

( ) , GO2CINEMA, API API.

: API, API.
: API?
: API .
: ?
: .
: API .

, , API .

: API API , API.

. .

, , 1) 2) .

98%

, , .

? :

, .

. ?

: ( , ).

, .

, , (. Reddit). .

X IP- ( « IP-»), , - . — «, », .

, - Excel . API — , -.

, , API. .

Source: https://habr.com/ru/post/353348/

All Articles