Peace botnet

Modern search engines are able to independently organize huge amounts of information, allowing you to quickly find materials on any topic. But when it comes to searching for goods in online stores or vacancies in the bases of recruiting agencies, or car offers on car dealership sites, in general, searching for any cataloged information on the Internet, there is no need to speak about the independence of search engines, because in most cases they require the sites-sources download ( Dat Feed ) their directories in a special format.

Automatic extraction of facts from directories that do not have semantic markup is not an easy task, but it is still much simpler than the task of extracting facts from an arbitrary unstructured text.

Technology

We have developed a technology that allows you to create full-fledged catalog-oriented search engines that do not require the presentation of data in any formalized form, and with the help of a few unusual search robot are able to independently extract information from arbitrarily-structured web directories compiled in any language. The unusual thing about the robot is that it is a JavaScript program that analyzes web pages by loading them into a frame (iframe) through a web proxy. In such a controversial at first glance, there are many advantages.
')
JavaScript-robot “sees” sites just as ordinary users see them. This allows it to process even those sites whose content is partially or fully formed from JavaScript, and, accordingly, is inaccessible to a regular robot. In addition, the ability to emulate various events (pressing buttons, for example) allows the JavaScript robot not only to view dynamic sites, but to navigate through them.

Analyzing web pages, traditional search engines focus mainly on content, without paying particular attention to design (design). A search engine robot should act even more selectively, extracting only certain facts from this content. Analyzing directory pages based on design, or more precisely, on how a user sees them, allows a JavaScript robot to do its job better.

The work of the “army” of search robots requires considerable computational resources. Unlike traditional crawlers, which require special software and hardware, JavaScript-robot allows you to integrate yourself directly into the search engine site, which makes it possible to use the computing power of end-user browsers while they work with the site. It turns out something between a botnet and peer-to-peer networks: the site provides the user with information - the user helps the site with computing power.

How does all this work?

Creating technology, we were guided by the following rule: if there is a task with which a machine can cope in a long but acceptable time, while a person can solve the same task much faster, the task is given to the solution of the machine, because, first human time is priceless, and secondly, our technology allows free use of the enormous computational capabilities of end users.

To connect a new directory to a search engine, in most cases it is only necessary to specify the URL of the first page of the directory. If this does not exist, then you need to specify the "nearest" URL. The system will load it through a web proxy into the frame and will closely monitor the actions of the user, who only needs to demonstrate how to go to the top of the directory as soon as possible. Demonstration may also be required if the site uses an unusual navigation system. That is, the connection system is just as complex as it is difficult for an ordinary user to get to the first page of the catalog.

Each new site is subject to research, during which the robot reveals its structural features. This helps in the future to better recognize the necessary data, as well as provides an opportunity to monitor changes in the design of the site and adequately respond to them. This stage is fully automatic and does not require human participation.

After completing the study, the robot proceeds to extract information, during which it relies on data obtained as a result of training to work with previously connected directories. In case this data is not enough and the robot cannot independently identify and extract all the necessary facts, the system provides an opportunity to further train the robot to work with a new catalog for it: the problem page opens via a web proxy, and the user is only required to specify (and perhaps explain) system not revealed facts.

It may seem that our technology requires constant human participation. It is not, I described the worst case scenario. With each new catalog connected, the system is getting smarter, and human involvement is required less and less and less.

Security

Downloading potentially unsafe pages of directory sites to the end user's browser can result in the user's computer being “infected” with malicious software. We are aware of the seriousness of this problem. At the moment, the crawler is disabled for Internet Explorer browsers (for mobile platforms, by the way, also for other reasons). We are also working on adding verification of downloadable resources using the Google Safe Browsing API . And since the robot opens all pages only through a web proxy, the latter obviously has the ability to analyze their contents. Now we are considering various options for how best to use this feature to maximize the security of end users.

On the other hand, nothing prevents and users try to falsify the results of the work and send false information to the server. To eliminate forgery, each task is considered completed only when the same result is obtained from several robots running on different computers.

Better to see once than hear a hundred times

To demonstrate the technology in action, we created a search engine Maperty , which is a map of the properties offered for hire. Currently, Maperty is only a laboratory bench for our experiments, and also serves the purpose of demonstrating the work of the technology: while the user works with the map, the search robot loads and processes new offers from real estate agencies.

Now the card is empty, but about 10 thousand offers in Moscow, St. Petersburg, Kiev, Belarus, Estonia, Poland and Ireland are awaiting their processing. We hope that with the help of habraeffect, the problem will be solved in just a few hours, but the system architecture is such that the data do not get onto the map instantly, but in portions, the size of which gradually increases as the end of the day arrives.

All interested are invited to the address www.maperty.ru . The robot is activated only in Chrome, Firefox, Safari and Opera, and it takes about one minute for the robot to parse one sentence, so if you want to participate in our small experiment, please do not close the browser window after seeing an empty card.

What (how and where) was used

Google App Engine for Java (server for a robot, server for Maperty);
Google Web Toolkit (robot, interface Maperty);
Google Maps API (Maperty interface)
Google Geocoding API (translation of addresses into coordinates on the map in the robot);
Google Language API (helps the robot);
Google Safe Browsing API (web-proxy, in work);
Yahoo! Finance API (used to convert currencies to Maperty).

Source: https://habr.com/ru/post/103884/

All Articles