What part of the web is archived

The Internet Archive Time Machine is the largest and most well-known archive that has stored web pages since 1995. Besides it, there are about a dozen other services that also archive the web: these are search engine indexes and highly specialized archives like Archive-It , UK Web Archive , Web Cite , ArchiefWeb , Diigo , etc. It is interesting to know how many web pages fall into these archives , regarding the total number of documents on the Internet?

It is known that the Internet Archive database for 2011 contains more than 2.7 billion URIs , many of them in multiple copies, made at different points in time. For example, the main page Habr “photographed” already 518 times, starting from July 3, 2006 .
')
It is also known that the Google base of references five years ago crossed the mark of a trillion unique URLs , although many documents are duplicated there. Google is unable to analyze all URLs, so the company decided to consider the number of documents on the Internet to be infinite.

As an example of the “infinity of web pages,” Google lists the web calendar application. It makes no sense to download and index all of its pages for millions of years ahead, because each of the pages is generated on request.

Nevertheless, it is interesting to scientists at least approximately to find out what part of the web is archived and stored for posterity. Until now, no one could answer this question. Experts from Old Dominion University in Norfolk conducted a study and received a rough estimate.

For data processing, they used Memento's HTTP framework, which uses the following concepts:

URI-R to identify the address of the original resource.
URI-M to identify the archived state of this resource at time t.

Accordingly, each URI-R may have zero or more URI-M states.

From November 2010 to January 2011, an experiment continued to determine the proportion of publicly accessible pages that fall into the archives. Since the number of URIs on the Internet is infinite (see above), it was necessary to find an acceptable sample that is representative for the entire web. Here, scientists used a combination of several approaches:

Sampling from the Open Directory Project (DMOZ).
Random URI sampling from search engines, as described in Ziv Bar-Yosef and Maxim Gurevich “Random sampling from a search engine's index” (Journal of the ACM (JACM), 55 (5), 2008).
The last of the URIs added to the Delicious social bookmarking site using the Delicious Recent Random URI Generator.
Service shorten links Bitly, links are selected using a hash generator.

For practical reasons, the size of each sample was limited to thousands of addresses. The results of the analysis are shown in the summary table for each of the four samples.

The study showed that from 35% to 90% of all URIs on the Internet have at least one copy in the archive. From 17% to 49% of URIs have from 2 to 5 copies. From 1% to 8% of URIs are “photographed” 6-10 times, and from 8% to 63% of URIs - 10 or more times.

With relative certainty, we can say that at least 31.3% of all URIs are archived once a month or more often. At least 35% of all pages in the archives have at least one copy.

Naturally, the above figures do not refer to the so-called Deep Web (Deep Web), which is commonly referred to as dynamically generated database pages, password-protected directories, social networks, paid archives of newspapers and magazines, flash websites, digital books and other resources that are hidden. behind the firewall, closed access and / or inaccessible for indexing by search engines. According to some estimates , the Deep Network may be several orders of magnitude larger than the surface layer.

Source: https://habr.com/ru/post/164975/

All Articles

What part of the web is archived

More articles: