Internet Archive will crawl websites regardless of robots.txt settings

A website is a common set of files and folders that resides on a server. Among these files there is almost always one, which is called robots.txt, it is placed in the root. It serves to instruct the "spiders", it is set up so that search robots understand what can be scanned and what is not. In some cases, webmasters use duplicate content (tags, categories, etc.) to improve SEO performance using such instructions, and also protect data from robots that should not be online for some reason.

The idea of robots.txt appeared more than 20 years ago and since then, although different settings for different search bots have changed, everything works the same way as many years ago. Instructions stored in this file, listen to almost all search engines, as well as the Internet Archive bot, which wanders through the Internet looking for information for archiving. Now the developers of the service believe that the time has come to stop paying attention to what is posted in robots.txt.

The problem is that in many cases the domains of abandoned sites are “dropping”, that is, they are not extended. Or just the contents of the resource is destroyed. Then such domains are “parked” (for a variety of purposes, including receiving money for advertisements placed on a parked domain). The robots.txt file webmasters usually close the entire contents of the parked domain. Worst of all, when the Internet Archive robot sees the file instructions for closing a directory from indexing, it deletes the already saved content for a site that was previously located on this domain.
')
In other words, there was a site in the Internet Archive database, and there is none, although the owner of the domain is different, and the contents of the site saved by the service have long sunk into oblivion. As a result, unique data are deleted that could well be of great value for a certain category of people.

Internet Archive takes snapshots of sites. If the site exists for a certain amount of time, there can be many such “snapshots”. So the history of the development of various sites can be traced from the very beginning to the newest version. An example of this is habrahabr.ru . When blocking access to bots using robots.txt, track its history or get at least some information becomes impossible.

A few months ago, Internet Archive employees stopped tracking instructions in a specified file on US government websites. This experiment was successful and now the Internet Archive bot will stop paying attention to the instructions in robots.txt for any sites. If the webmaster wants to delete the contents of his resource from the archive, he can contact the Internet Archive administration by mail.

So far, developers will monitor the behavior of the robot and the work of the service itself in connection with future changes. If all goes well, these changes will be saved.

Source: https://habr.com/ru/post/403391/

All Articles

Internet Archive will crawl websites regardless of robots.txt settings

More articles: