INTERNETARCHIVE.BAK: Internet Archive Service Data Archiving Project

The Archive Team service team decided to launch a new project : archiving data that is currently stored on Internet Archive servers. The main idea voiced by the authors of the project is the preservation of the most important information, which is now stored only in one place - in the Internet Archive DC. If something happens to the data center of this organization, the invaluable information is simply lost.

It is worth noting that this recursive project can really be of practical importance, moreover, it is not so difficult to implement it. The fact is that, according to estimates of the Archive Team, the volume of all information stored on Internet Archive servers is relatively small - 21 petabytes of data. 20 petabytes are 42 thousand 500 GB of hard drives that are not very expensive now. In addition, there are 1, 2, 6 and even 8 TB drives.

At the same time, the service does not plan to purchase all 42 thousand hard drives and create a new data center to store all this information. Instead, the authors propose to create a distributed system that would allow storing information in parts on users' computers that would agree to participate in the project. With the participation of a large number of users, information can (and should) be duplicated, reducing the likelihood of a global glitch that could lead to the destruction of unique information.

According to the plan, users who decide to join the project, install the appropriate software, and give access to a certain part of their file space (on a PC, laptop or external drive) that will be used by the “spider”, which saves information from the Internet Archive. At the same time there is a condition - the free part of the file space should not be encrypted, and should be open to the bot system.
')
Once every three months, the client part of the program will need to be launched to verify the stored data: on the Internet Archive the information is updated and added constantly, so the archive cannot be static. If there are changes, the program adds / changes files on the user's hard disk. If the client is not launched at specified intervals, then after a certain time, such a piece of data will be marked by the distributed system as obsolete, and it will be lost for the system.

The more users will connect to the system, the lower the probability of losing such a data segment.

Now the structure of the system is still under discussion, and the authors of the project are open to discussion. Probable ways to implement the project are divided into several points:

INTERNETARCHIVE.BAK / git-annex_implementation
INTERNETARCHIVE.BAK / torrents_implementation
INTERNETARCHIVE.BAK / ipfs_implementation

Now the editorial Geektimes plans to contact the initiators of the project, asking for additional information about the system. If you have questions, ask in the comments, and in a couple of days we will forward all the authors of INTERNETARCHIVE.BAK.

Source: https://habr.com/ru/post/377169/

All Articles

INTERNETARCHIVE.BAK: Internet Archive Service Data Archiving Project

More articles: