According to one of the definitions, parsing is the syntax analysis of information. For a person who is not involved in the specific tasks of collecting and processing information for Internet projects, this does not mean anything. Yes, and the definition itself only in general terms means a huge amount of work that hundreds of millions of people and tens of millions of robots (albeit virtual, but no less real) from all over the world do every minute. But this problem is common for a person - comparing ticket prices online, choosing the right electronics on the websites of stores ... Watching prices and promotions in a convenient mobile application of the hypermarket nearest to your home, none of us will ever consider dubbing ourselves a parser.

Nevertheless, business parsing exists, works and, of course, is the subject of lively discussion at many levels of consideration: ethical, legal, technological, financial and not only.
This article does not express a definite opinion, does not give advice and does not reveal secrets - here we will only consider some of the opinions on the example of the most interesting comments to one particular
article about parsing (50k views and over 400 comments!) On Habré, interpreting them from the perspective of its experience in the tasks of parsing web projects. In other words, we spent a lot of time and tried to bring together and classify the most interesting comments of readers ... worldly wisdom, so to speak :)
So, about parsing:
')
"The case of technology." Fantastic proxies and where they live.
As the very idea of ​​parsing is natural (it is always interesting to see what happens at the “neighbors”), the basic ways of its implementation are just as simple. If you want to know, ask, but if you want to know the actual values ​​of a large data array (whether it be the prices of goods, their descriptions, volumes available for ordering or hot discounts), then you will have to “ask” a lot and often. It is clear that no one would even think of trying to collect this data manually (except for a large brigade of hardworking guys from southern countries who are not inspired by the most humane way), therefore simple effective solutions are made to the forehead: “pile” the site, configure the browser, to collect bots - and we “knock” the target site on the subject of indicators of interest, carefully write down the answers in a “notebook” in a convenient format, analyze the data collected, repeat.
Here are some approaches to the "technique of parsing" from our readers and from us:
- “Farm Selenium - and forward!” (This refers to headless browsers with BeautifulSoup-like, like Selenium / Splinter, solution). As our reader says, he wrote a small website on the docker swarm cluster to his wife to monitor merchant sites (she is an importer) so that they do not violate the ppp / mrc policy (recommended retail prices). According to the author, everything works stably, parsing economics converges - “all costs are 4 nodes for $ 3 each”. True, the proud author has only about a thousand products and a dozen parsing sites, no more :)
- “We are launching Chromium and everything is OK, we get 1 product in 4-5 seconds, we can take ...”. It's clear that no admin will be pleased with the jumped load on the server. The site, of course, is needed in order to provide information to all those interested, but “there are many of you, but I am one,” therefore, they are especially zealously interested, of course, ignored. Well, it does not matter: Chromium comes to the rescue - if the browser is knocking on the site in the “we only ask” mode - it can be done without a queue. Indeed, in the general array of parsing tasks, in 90% of cases, parsing of html pages is done, and in “especially serious cases” (when websites are actively protected, like Yandex.Market, asking for a captcha) Chromium does it.
- "Clean proxy do-it-yourself from LTE routers / modems." There are quite working ways to configure clean proxies that are suitable for parsing search engines: a 3G / 4G modem farm or the purchase of a “white” proxy instead of a set of random “dirty” proxy servers. It is important here what programming language is used for such industrial parsing - 300 sites per day (and the correct answer is .Net! :). In fact, the Internet is full of sites with open proxy lists, 50% of which are completely working, and from these sites it is not so difficult to parse proxy lists, so that with their help you can parse other sites:) ... Well, we do that.
- Another case in favor of Selenium: “I myself am engaged in parsing (but not in runet, but I catch orders from beloved by everyone at upwork.com, there it is usually called scraping, a more appropriate term, IMHO). I have a slightly different ratio, somewhere 75 to 25. But in general, yes, if laziness or difficult - then no one has yet dodged selenium :) But out of several hundreds of sites that you had to work with, never reached recognition images to get target data. Usually, if there is no data in html, they are always pulled up in some json (well, actually, an example has already been shown below).
- "Python Handlers". And another reader case: “In the past, I used Python / Scrapy / Splash for 180+ sites of different sizes per day from prisma.fi and verkkokauppa.com to some trivia with 3-5 products. At the end of last year, they hired such a server from Hetzner (https://www.hetzner.com/dedicated-rootserver/ax60-ssd) with Ubuntu Server on board. Most of the computing resources are still idle.
- "WebDriver is our all." Being engaged in automation in general (where already parsing falls), as reliable as possible (QA tasks). A good workstation, a dozen different browsers in parallel - the output is a very wicked-fast thresher.
“Gentlemen's dialing” of the parser - 4 virtual machines, unlimited traffic, 4 processors on each, 8 GB of memory, Windows Server ... So far enough, for each new batch of conditionally 50 sites - we need our own virtual machine. But highly dependent on the sites themselves. Also in Visual Studio there is System.Net, which actually uses Internet Explorer installed in Windows. It works too.
“How to protect (from parsing) on ​​the mind? No way, anyway crawl "
Ideas for business on parsing, speaking about our business, we are thrown up constantly.
- Issuing Yandex parsit, as do many SEO-services. “There is more demand for this, more money. True, they mostly sell a whole SEO analytics system. ”But we don’t parse the issue - they didn’t ask, and there will be a captcha at once through 100 requests, we need clean proxies, and they are difficult to get or expensive, not so profitable ... yourself, big players are far from being so easy to carry out, and the readers share with us this (we ourselves are NOT Parsing GUGL and Yandex). According to the experience, Yandex, Google and similar large corporations have a certain base with subnets of data centers (the proxies of them are updated, and the big players are signed and banned). Thus, the raised proxy network on the ip-addresses issued to data centers perfectly flies to the ban with the issuance of a captcha and other quirks. As a result, there are only illegal options for purchasing proxies from botnet owners and such “dirt”, in this case you will have a real user ip. And even so, it is very necessary for such corporations to have cookies that have been “settled down”, with which you have already crawled for some time around sites where they can track you (for example, visitor counters). But how do they generally distinguish parsers from NATs in sleeping areas? The conditional 100 requests are nothing.
- Protection from parsing: removing from consideration of the "great and terrible", we focus on us, "mere mortals." If there are those who are engaged in parsing, there must be those who will try to prevent them from doing so. It is more interesting to play with real people: an element of rivalry appears, each side tries to outwit the other. And, since no one intends to collect information manually, they play into who will make a bot most similar to a living person, and who will be able to recognize these bots more efficiently, while continuing to respond to real users ’requests — the site is designed to help businesses , we are repelled by this. And, remaining within the framework of the task of business efficiency, it is impossible not to take into account the rational allocation of resources and the profitability of measures for, in fact, parsing and counteracting it:
- It is impossible to protect oneself from parsing (except from “students”), but it is possible to raise the cost threshold for it (both temporary and monetary). As a result, the data that we protect (several sections of the site) is easier not to parse, but to go and buy a ready-made database, just like we buy it. Tables of IP addresses of parsers are lying around in the network; to show this list captcha at the entrance is not a problem. Similarly, generating id and classes, as mail.ru does, is also not a problem and does not require any large expenses. A new captcha from Google in general very accurately determines whether the robot or not. If you suspect, cut out the user and ask to enter the captcha - simply. In the end, the bait-HoneyPot for catching the bot has not been canceled. Well, a classic, replace the letters in the text, make masks and so on.
- And here we will object to ourselves: perhaps, separately, all this will not help, but all together it will complicate your life so much that it will become inexpedient. Moreover, all these techniques do not require large expenditures at all. True, all these techniques are well managed, so in fact there is no protection. Dynamic proxies, services recognizing captcha by Indians, and selenium with a well-defined algorithm of actions. All that can be achieved is that the development of the parser will be more expensive, perhaps it will scare someone, but if the target site is not a directory for one and a half pages of the Horn and Hoof local office, then the increase in costs is scaring few.
- When protecting, it is always about using typical behavioral models of real visitors, plus systems that adequately identify “white” bots (Yandex, Google, etc.). And in order to adapt to the real visitor, you need to know a set of typical transition maps. And then a simple proxy pool when parsing will not work here. The system is not 100% protected, but it solves the problem: according to viewing statistics, you can understand when the entire site has been scanned. So do either parsers or search engines. But search engines respond to robots.txt, but parsers do not.
“Oi vey. If all the people did everything according to the mind ... I think there would be 10 times more unemployed people. Enough for your age. ”
“Do I live environmentally? Yes, but in vain "
- In the moral and ethical level of consideration of the issue is an important point relating to the technical and the legal side of parsing. Laconic in its simplicity and symbolic in the name of the file robots.txt, which our readers and we interpret differently:
- Your activity as a “driver” bot is “ethical” just as much as your bot complies with the robots.txt of the visited site. Not based on the assumptions of the type “product pages do not close”, but literally imposing allow and disallow masks on the requested URLs. Missing robots.txt - treat for your benefit; is present, but you break it - clearly, you are maliciously using the site. Of course robots.txt does not have the force of law, but if it really does, it’s not a fact that they’ll definitely pass by lawyers. ”
- Despite the fact that it is impossible to come to an agreement with the robots, sometimes it is easier than with people, because “photo is prohibited” signs are being put up in shops, and this is illegal. And unethical. “Just a tradition like that. robots.txt is a technical device. He is not talking about ethics. If you want to indicate that you do not want to parse - make a section like this: account.habr.com/info/agreement. I don’t know whether this restriction will be legal, but at a minimum, you can state your wishes there in human language (or mention robots.txt), then you can talk about ethics. "Our lawyers parry:" In no way will this restriction be legal. "
- We think synchronously about parsing and about the further use of information. “Robots.txt is not so much about parsing as about further publication (for example, in search results). If you want the data not to be received by anyone, then you should limit the circle of people who can see them. If you do not have curtains on the windows, then you should not go naked. It may be specially to look at the windows and ugly, but without any curtains any complaints? "
- Ethical parsing is neutral. It may be unethical to use the information obtained. In general, purely in terms of ethics, everyone has the right to receive public information that is not private or special and is not protected by law. Prices are exactly public information. Descriptions - too. Descriptions may be subject to copyright and cannot then be posted without permission. But no ethics are violated, even if I parse the sites and make my own public site, which will reflect the price dynamics and comparison of competitors. This is even ethical, as it provides socially useful information. ”
- “You can collect hands, but you cannot parse with a robot.” Any “evil” with proper diligence and skill can be justified, and parsing even more so - especially since there are living examples of how it was used in every sense correctly, we quote our reader: “I was engaged in a long time ago parsing, but I was always asked to do legally and morally correct parsing. Several times the intermediaries asked to do the parsing of the wholesaler (for the sale of his goods), the wholesaler himself did not mind, but he did not intend to invest in the development of the API (or could not for technical reasons); once the intermediary of one Chinese store asked to do the integration, but there the api of the Chinese store was so dumb and limited that partly it was necessary to receive information by parsing; once the author and owner of the site and forum wanted to migrate from a free site that “clamped” the database; I also did the integration of the literary contest site and its forum, so that when you add a new story, a topic automatically appears on the forum (for technical reasons it could not be done otherwise). ”
“A lawyer called? Quote can not parse "
Regardless of whose side you choose in determining the source of power: money or truth - one thing is clear, where money begins to flow, it becomes more and more difficult to find the truth. Carrying the discussion about the possibility of acquiring “money signs” of everything and everything, including the law itself and its representatives, beyond the scope of this article, consider some of the legal aspects raised in the comments:
- "From peeping to theft - one step." Even if everything that is not forbidden is allowed, then, our readers believe, “to pry into the keyhole is at least ugly, and if the client then also gives away the falsity for his own, then this is direct theft. Of course, it is clear that in business everyone is doing it. But in a decent society, it is still customary to keep silent about it. ”However, to parse for someone and to give a spouse for their, as they say, two big differences:“ You confuse soft and cold. We really provide a parsing service. But it is exactly the same way to blame manufacturers, for example, weapons for killing with its help. We do business, and in business there is one rule - legal or not. My point of view ... If customers come to us and are willing to pay a lot to get data - is that really bad ... "
- "Made an application for the media site - nailed for the complaint." Forbes site, parsing, app on google play - what could go wrong? “I once decided to make an application for the Forbes website. To get articles from the site - made the parsing pages. I set everything up automatically and made an application for Android. I posted the application in the market. A year later, a lawyer contacted me and demanded to remove the application, because I am violating copyrights. Did not argue. It's a shame that Forbes itself has no application for their own articles from the site. There is only a site. And their website is slow, it takes a long time to load and is hung with ads ... ”
- “My database is my protected work!” Copyright is another concept that can be devoted to a dozen pages of discussion (besides the hundreds of thousands already existing), but not mentioning it in any way is also wrong. Our reader gave the concept: “Someone has created a database of goods. I spent a lot of resources on searching for information, systematizing this information, putting data into the database. You, at the request of a competitor, base this base and give it to that same competitor for money. Do you think that there are no ethical problems? Regarding legality, I don’t know how in the Russian Federation, but in Ukraine the database may be subject to copyright. ”
However, the responsibility for using the service or product still lies with who gets it and uses it for what purpose: “... and in Russia too. We provide a data collection service. And for this service we ask money. We do not sell the data itself. I, by the way, warn all customers that they may violate the law if they use, for example, descriptions. ” - “Formally, you are right, but you found an article on you!” In the Criminal Code of the Russian Federation (article 146) only the scale of the violations are described, which allow classifying a copyright infringement as a “criminal act”. The rights themselves are described in the Civil Code - and on the scale, allowing to classify the act as a “criminal act”, regular parsing, such that the question arises “whether the site will lie down,” is pulled out without problems. But aspects are important:
- There, “large size” is not in the number of sparse pages, but in money. How do you generally rate parsing (and its regularity) as copyright infringement (!) In money? And how do they usually do it in such cases, and where can a fine of hundreds of thousands of dollars for one copy of the film come from? Calculated "lost profits" with the appropriate ratio. It is possible to calculate from any contracts - how much will it cost to buy the same information from you legally and from here “dance”. But, for a start, you should initially sell it (and not lay out in public access), inventing a figure in hindsight will not "give it a ride." Although there are risks here too: do you know how much a commercial license for a conditional Consultant Plus costs? As soon as you get further a dozen basic laws, you will quickly stumble upon an offer to buy that same commercial version.
- Our story is definitely not from a criminal case (and you do not confuse fines and damages. So you broke a bottle of beer for a hooligan: damage - 30 rubles, a fine - up to 1000r, and according to a civil lawsuit, then at least trillion sue for “lost profits”, but no longer fine). You do not sell the price at all, that the expert will write something? Specifically, not "a good lawyer will pull without problems."
Summarizing: “- How did parsing become equal to copyright infringement? - no one. Violation - is to order our parsing, and then dump the content on your site. “Putting” a website is another article. ”Maxim Kulgin,
xmldatafeed.com