On Habré already there were
several articles about the difficulties associated with obtaining access to the list of prohibited sites, with its updating and use. This article is a logical continuation of the criticism previously expressed by others (including in the comments). Immediately make a reservation that I am not an employee of any provider.
So, suppose that you are going to provide your customers with Internet access services, or, more simply, to become a provider. In order to achieve customer loyalty, you decided to buy a sophisticated DPI-system, block forbidden information by URLs and not block anything superfluous. No filtering by domain and IP, only by URLs! All legal, bureaucratic, ethical and monetary issues are settled, technical issues remain. It remains only to take a ready-made automatic rocking of the list of prohibited sites and configure the automatic download of this list to the DPI-system in the format that it understands. Those. write a script converter. So, I have to disappoint you - write a working converter will not work. It will not work until Roskomnadzor does not move and change the data format, and also does not correct obvious errors in the existing elements of the list.
To begin with, in what format Roskomnadzor gives the list of prohibited information. This is the XML corresponding to the XSD scheme published as part of the
Memo to the carrier . Or, to put it more simply, XML containing a sequence of blocks like the following:
<content id="105" includeTime="2012-11-11T15:39:37"> <decision date="2012-11-04" number="2/1/16402" org=""/> <url>http://go-****.com/workshop/</url> <domain>go-****.com</domain> <ip>62.75.***.***</ip> </content>
')
In this case, according to the scheme, <url> tags can be from zero to infinity, followed by zero or one <domain> tag, and then from one to infinity <ip> tags. We are obviously interested in <url> tags. So, like, you just need to close access to the listed list of URLs.
And now let's see what means similar tasks were solved earlier and what clever words were said.
Access to URLs on servers must be closed not only from people, but also from robots. To do this, use the /robots.txt file with the
documented syntax. No less important, the
semantics is described in detail in the same document, i.e. the exact rules for interpreting each record to decide whether the robot can visit any URL:
The matching process compares every octet in the path portion of
the URL and the path from the record. If a% xx encoded octet is
encountered it is unencoded prior to comparison, unless it is the
"/" character, which has special meaning in a path. The match
if you want
Record is reached.
That is, for the entry to work, the path in the robots.txt entry must be
the path
prefix in the URL. There are simply no corresponding rules on the Roskomnadzor website, and this, from my point of view, is a bug. Maybe try to write these rules for them? And it seems that filtering by prefix, and not by exact URL match, is more appropriate in the case of content blocking from people. Roskomnadzor will not list all the URLs of the site that should be blocked entirely!
Another standard for robots.txt is the
Clean-Param extension. It indicates which
GET parameters are
insignificant , i.e. should not be considered when comparing URLs. The very notion of the importance of parameters is important - this is in fact bad if the user can bypass the lock by adding unblock_me = 1 & after the question mark in the URL. Only in the case of blocking content from people, it would be more correct to talk about meaningful parameters and that the order of these parameters really does not matter.
In total, this speculative scheme emerges plausible interpretation of the meaning of URLs in the registry of banned sites:
- Before applying the following steps, both the URL in the registry and the URL addressed by the user must be normalized by replacing the% xx encoded octets with decoded octets, with the exception of the “/” symbol.
- If the URL list in the list of forbidden sites does not contain the "?" Character, then in order for the blocking to work, the URL in the registry must be a prefix of the URL the user has accessed.
- If in the list of prohibited sites the URL contains the character "?", Then in order for the blocking to work, the URL in the registry and the URL that the user has accessed should match character by character up to the first "?" inclusive. The set of GET parameters in the URL from the registry must be a subset of the set of GET parameters from the user URL. GET parameters are considered to be separated by the symbol "&".
Only this scheme is still incomplete. Do I need to make a case-sensitive comparison? How exactly is case sensitive - because we only have bytes in an unknown encoding?
And most importantly - the existing data in the registry under this scheme is not suitable. What are just such records here:
<url>http://*******tube.ru/index.php</url> <url>http://********.kiev.ua/index.php</url> <url>http://***forum.org/index.php?s=3a95f6da301a36067be68329be6f88a8&showforum=8</url> <url>http://****lib.net/b/27415/read#t16</url>
In the first two cases, I, as a reader of thoughts, obviously intend to block the site entirely, instead of which only URLs are blocked whose path starts with /index.php. In the third parameter, s is similar to a non-significant session identifier. In the fourth, there is a hash tag. In general, the source data is too dirty for the circuit to work.
And even if it worked, I would not use this scheme. When writing it, I made too many
attempts to guess something , and all this strongly resembles an attempt to write another
libastral clone. When interpreting laws, such actions are unacceptable.
So, if someone tries to sell you a filtering system by URLs based on official data from zapret-info.gov.ru, do not believe it - this is definitely a divorce. Until Roskomnadzor makes its data truly machine-readable and unambiguously interpretable, such solutions simply cannot work. To this day, the <url> tag in the XML dump of the registry is only suitable as background information to check the validity of including the site in the registry. What we really need are not URLs, but filtering rules.
Now let's talk about what Roskomnadzor could do to correct the situation when people talk about the fundamental possibility of URL filtering on its lists, but in reality there is no such possibility.
The easiest (and, in my opinion, the most correct) way is to leave everything as it is, but to publicly admit that information in the registry of prohibited sites is unsuitable and may not be suitable for filtering by URLs. Leave the <url> tag - as already mentioned, it is useful for checking locking decisions and thus ensuring the transparency of the process.
The more complicated way is to write the rules for interpreting the contents of the <url> tag (as I tried to do above) and adjust the existing contents of the database to them.
Another way is to remake the XML structure. Instead of a single <url> tag, make a structure that can store a prefix, required parameters, and possibly other information that can describe a group of related URLs. Then it can be turned into a regular expression for
“acl aclname url_regex” in SQUID or glob for
“match protocol http url” in Cisco NBAR .
And this is why I consider the first method to be the most correct. It seems that Roskomnadzor has implemented such a scheme of work: a complaint with a URL of illegal content comes in, moderators check it and enter the same URL into the database. In this process, there is no place for converting the complained URL to machine-readable filtering rules. And if you try to make this (essentially manual) step, then you need to look for people who can perform it. This is also necessary to explain the performers that the machine can not read minds. It is also necessary to find people who are psychologically stable, and are able to put themselves in the place of the car and check whether the rule really works as it should. Difficult task for the personnel department! Well, if you leave the <url> tag purely informational and explicitly say this, then the providers will not have a desire to spend money on a sophisticated DPI-system, which in fact will not be able to be used for its intended purpose.