There are probably few areas in the computer technology that attract everyone’s attention and, at the same time, are surrounded by so many myths and misunderstandings, such as long-term archival data storage. As a person who, in his professional practice, faced with the actualization of data of many years of prescription and with the organization of long-term archives, I would also venture to speak out on this subject.
A brief summary for those who are too lazy to read the whole article: there is no silver bullet.
For those who are interested in a more detailed discussion of the issue, the following text is intended.
So, the transition to paperless computer science, about the need for which the Bolsheviks spoke so much, was accomplished. The amount of data on digital media is doubled every two years. Few modern youth cares to print interesting texts or images (I myself, referring to middle age, also neglect paper, have almost forgotten how to write by hand, and prefer to download a book from the Internet library to a smartphone, rather than go for its paper version to the cupboard in the next room). But, unfortunately, digital conveniences have a downside, which is the problem of long-term storage.
')
Speaking of long-term storage, I mean the planning horizon from 25 to 100 years, that is, such a time period that will allow modern man, retaining some private information in his youth, then be able to return to it during his life, and even pass descendants (to the question about the example given in the headline with the great-grandmother's selfie). For business, such long-term storage has a more specialized value, since very few business processes work with data for similar time periods (although organizations with such processes certainly exist and are usually clearly aware of their specifics).
As a first approximation, there are three levels of consideration of this problem, the attention to which the general public decreases from the beginning to the end of the list.
1. Physical safety of carriers and the specific cost of storage.
This is the most widely known level of consideration for which many publications are limited. Let's not pour from empty to empty and repeat well-known things, and briefly summarize that today in everyday user practice three categories of archival media are used:
- Optical discs (CD, DVD, BD, etc.) and flash drives. It is believed that the data on such media can be destroyed in a few years, and, in any case, after 25 years, it is most likely unlikely to be read.
- Magnetic media (hard drives and tapes). There is an exit to a big flame between supporters of disks and tapes, in which, in short, diskov reproaches lentochnik with exotic, low speed of random access and high cost of reading and writing devices, and lento reproach disc arrays with the vulnerability of media, high energy consumption and high specific cost. storage for large amounts of data. Without going into the validity of certain arguments and counterarguments in the disco-tape war, we note that archived magnetic media nowadays often have a stated time of preservation not less than 30 years, although, of course, this number was obtained by extrapolating the results of intensive tests, and not by field 30-year observation.
- Network archives. Here the idea is to reassign the storage of their data to specially trained people in specially authorized firms, and to consider such network storage as a black box with an interface in the form of an Internet service. The advantage of this solution is that, undoubtedly, professionally providing such services, firms are able to take much better care of data safety than an ordinary user (and to do this potentially indefinitely), and at the same time ensure low storage costs due to large-scale effect. The downside is independent of user risks. The main risk for long-term storage of information in the network archive is the sudden liquidation of the business of the service company, which, unfortunately, no one is insured. An additional risk is the potential for the future establishment of border, content, format, or other restrictions on the transmission of information via the Internet, which may make access to a remote archive impossible, by various state and Internet service providers.
So, reasoning moderately pessimistic, it can be concluded that the physical integrity of the data can currently be provided with controlled risks of about 30 years ahead.
2. Technical compatibility of carriers.
This issue is considered much less frequently. Let's use the previously obtained assessment of physical security, we will conduct a mental experiment and estimate which medium my great-grandmother could have recorded her digital data, only my mother 30 years ago.
So, 30 years ago was 1986. Depending on his technical preferences, the user of that time might have considered the most trustworthy storage medium for data: a 9-track large computer magnetic tape; 5 or 8 inch floppy disks widely used on personal cards; or the latest for the time 800-kilobyte 3-inch floppy disk for the Sony company from the Macintosh computer (incompatible with the later 3-inch 1.44 megabyte drives). Even assuming perfect physical safety of the carriers, reading nowadays from any of them, of course, is possible, but it will cost a considerable amount of time and money, which hardly anyone will contact for the sake of my mom's selfie. After another 30 years, the technology of reading these media is likely to be completely lost.
Maybe it was only 30 years ago that everything was so bad because of the infancy of computing technology, and today we are free from this problem? Let's look at modern media.
As a long-term archival storage medium, LTO standard magnetic tapes are now clearly positioned. The world of LTO is arranged in such a way that every 2-3 years a new generation of the standard is produced, differing in approximately twice the capacity, and equipment is produced for this generation (now the current standard is LTO-7). However, the LTO standard regulates (and the generally accepted practice of manufacturers ensures) the compatibility of LTO tape drives with media for reading only two generations ago, and for writing - for one generation. This means that a modern LTO-7 tape drive can only read LTO-7, LTO-6 or LTO-5 cassettes, and a modern LTO-7 tape, recorded today, will be incompatible with LTO-10 tape drives, which can be predicted for about 2022 After 10 years (in 2026), a modern cassette will not be read by any device on the market. In this regard, the guarantees of the 30-year security of the tape itself are somewhat romantic.
Suppose we take the side of disk drives and record information on a modern SATA or SAS hard drive. These interface standards have already been for more than 10 years, and it is extremely unlikely that they will hold on at least 10. The same applies to USB in its current form. The absence of actual soil makes all the arguments about the distant future of physical interfaces extremely speculative, but it can be assumed, for example, that in 10-20 years the interfaces of disk devices may well become optical, and in this case they will be incompatible with modern devices already at the level of the data transmission medium.
Based on the foregoing, it is highly unlikely that modern magnetic media could be recognized by any standard computer device after 30 years.
Storing data in a network archive allows you to pass these problems on to specially trained people, but remains with the risks indicated in the previous section. It is appropriate to recall that most of the leaders of the computer market 30 years ago have now been eliminated, with a few exceptions, like IBM, Apple and Microsoft, which, however, have since changed their scope of work very significantly.
3. Compatibility of data formats.
This question is written very rarely.
Since 30 years ago, after all, in fact, there was no digital selfie, let's imagine that we got a simple text electronic document from 1986, and that we managed to solve all technical problems and write it to a modern computer file.
Due to the wide variety of computer world in 1986, there may be a lot of options here, so consider only a few:
- from the user of the 1986 mainframe, we can get to the disk an image of a virtual deck of punched cards with fixed 80-character entries in the EBCDIC (DCOI) encoding;
- we will receive a ClarisWorks document from a Macintosh user;
- from the PC user, we will receive, for example, a ChiWriter or WordPerfect document from the DOS text editor, although with success it may turn out to be a plain text file;
- and only with the Unix user we’ll get almost exactly lucky, and we’ll probably get an ordinary readable text file from it (in the Russian encoding koi8-r or even worse).
This is the situation with the most banal type of document, plain text. If we imagine that, for example, a drawing from 1986 came to us, we can almost with absolute certainty say that we cannot interpret this file in any way.
What is the basis of our implicit confidence that we will be able, having escaped for half an hour from the embraces of Alzheimer's, to show vague photos from our 2016 holidays to our bored grandchildren? Suppose, with a certain optimism, one can imagine that the jpeg format, due to its enormous prevalence in modern life, can somehow be converted into image formats that will be adopted in the bright Alzheimer's future (although there was no historical precedent for such a long life span). But it certainly will not apply to the raw camera formats, office document formats like doc / docx, fb2 / epub e-books, etc., simply because there is no subject with a goal and The ability to provide unlimited compatibility of this format.
4. What to do?
Maintaining a digital archive up to date is quite a complex and time-consuming activity, regardless of its purpose and the technical means used. This activity should include a complete revision of the archive every few years, with the transfer of all its contents to new data carriers, as well as, if necessary, converting each document obsolete in format to a new, up-to-date format.
It can be assumed that since few of both private users and legal entities will take the trouble to deal with such things, then we are on the threshold of a new stage in the development of human society, which will be characterized by individual features of a return to the pre-list a state when, for the most part, reliable data on the personal and public past will be lost during the lifetime of a single generation, and the remaining few relevant digital archives will become quite easy to fake in I go largely centralization.
At this, the lyrical digression can be completed, and a (commonplace) practical conclusion may be that maintaining any archive requires active exercises to maintain the relevance of its constituent data, and not just passive dropping files into the information heap. People who are engaged in such conscious archiving, including in private life, exist and are well known, and nothing prevents to join their practices.
A selfie for great-grandchildren is better to print on photo paper just in case.