
On October 25, activists and employees of the Internet Archive held a
solemn ceremony about a significant event: the Internet archive exceeded 10 petabytes (10
16 bytes). Thanks to this archive with the
Time Machine, we can see how famous sites looked many years ago, find saved copies of web pages, or simply restore your site from a “free backup”.
The Internet Archive announced the
distribution of 80-terabyte sampling samples for 2011 to everyone for research.
WARC files contain about 2.7 billion URIs. They include all the text content and everything else that has been saved, including images, videos, flash, etc.
Sample:
Starting date: March 09, 2011
End Date: December 23, 2011
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
')
Heritrix Spider first downloaded
1 million of the most popular sites according to Alexa (Habr was already there), and then followed the links.

Another interesting fact announced at the ceremony. For the first time, all the literary heritage of an entire people is completely digitized and uploaded to the Internet. These people became the
Balinese .
The Internet Archive celebrated the festivities with the presence of the legendary programming scientist and ideologist Donald Knuth. He played the organ by opening the ceremony.
