Developers from the Crawling and Indexing Team
reported on an important experiment that started just recently. They upgraded the crawler and began to experience the technology of intelligent processing of HTML-forms. After the upgrade, the robot crawler must learn to get hidden URLs and web pages that are generated in response to the processing of forms on various sites and which cannot be obtained by other means.
In practice, the technology works like this: when it encounters an element, the form handler makes a number of test requests. For text fields, the words from this site, on which the form is located, are automatically selected as requests. Values ​​of checkboxes and drop-down menus are taken directly from the page code. After that, the program tries to process the resulting URL. If the page really contains some content, then it is sent for indexing into the general search index.
Despite the seeming simplicity and obviousness, processing HTML forms is a very important step in bringing to light the so-called “Invisible Network” (Deep Web) - huge amounts of information that are hidden in large databases that are open to the world through the interfaces of HTML forms. These are legal databases, various reference books (phone numbers, addresses, prices) and other data files. By some
estimates , the Invisible Network contains hundreds of billions of pages and covers 90% of the entire content of the Internet. It should be noted that it is there that hides the most valuable content, which is still not available through standard search engines.
True, in any case, a huge chunk of the Invisible Web will still remain beyond the reach of Google, because the crawler is forbidden to enter any passwords or other personal information into the form fields: this is the decision of the developers and the Google manual. But many sites provide open access to information only after free registration on the site. But from a legal point of view, Googlebot does not have the right to create a fictitious person specifically for registration, because it is a fraud and contrary to the principles of the
always friendly guglobot .
')
By the way, knowledgeable people have already
explained where the legs of the new technology of crawling come from. Most likely, it was created by a team of developers from a small company Transformic, which Google
acquired in 2005. For the past two and a half years, they have been working hard, refining their development and helping to integrate it into Google Crawler.