About the strange method of saving hard disk space

Another user wants to write to the hard disk a new piece of data, but he does not have enough free space for this. I also do not want to delete anything, since "everything is very important and necessary." And what should we do with it?

This problem does not come up to him alone. On our hard drives terabytes of information are resting, and this number does not tend to decrease. But how unique is it? In the end, because all files are just sets of bits of a certain length and, most likely, the new one is not much different from the one that is already stored.

It is clear that looking for already stored pieces of information on a hard disk is a task if not a failure, then at least not effective. On the other hand, because if the difference is small, then you can adjust a little ...

Image from rematelier.ru

TL; DR - the second attempt to talk about a strange method of optimizing data using JPEG files, now in a more understandable form.

About bits and difference

If you take two completely random pieces of data, then they contain on average half of the bits contained. And indeed, among the possible layouts for each pair ('00, 01, 10, 11 ') exactly half has the same values, everything is simple.

But of course, if we just take two files and fit one under the second, then we will lose one of them. If we keep the changes, then simply re - inventing the delta encoding , which already exists perfectly without us, although it is not usually used for the same purposes. You can try to build a smaller sequence into a larger one, but even so, we risk losing critical data segments when used thoughtlessly with everything.

Between what and what then can the difference be eliminated? Well, that is, a user-written new file is just a sequence of bits with which we cannot do anything by itself. Then you just need to find such bits on your hard disk so that you can change them without having to store the difference, so that you can relive their loss without serious consequences. Yes, and it makes sense to change not just the file itself on the file system, but some less sensitive information inside it. But what and how?

Fitting methods

Files compressed with losses come to the rescue. All of these jpeg, mp3 and others, although they are lossy, contain a bunch of bits that are available for a safe change. It is possible to use advanced techniques that in an imperceptible way modify their components in different parts of the coding. Wait. Advanced technology ... inconspicuous modification ... some bits to others ... yes, this is almost steganography !

And the truth is, embedding one information into another like nothing resembles its methods. Also imperceptible to the invisibility of the changes to the human senses. This is where the paths diverge is in secrecy: our task is to add additional information to the user’s hard drive, it will only hurt him. Forget more.

Therefore, although we can use them, we need to make some modifications. And then I will tell and show them on the example of one of the existing methods and the common file format.

About jackals

If we compress, then the most compressible in the world. This is, of course, about JPEG files. Not only is there a ton of tools and existing methods for embedding data in it, it is also the most popular graphic format on this planet.

JPEG !!!

However, in order not to engage in dog breeding, you need to limit your field of activity in files of this format. No one likes monochrome squares, arising from excessive compression, so you need to limit yourself to working with an already compressed file, avoiding transcoding . More specifically, with integer coefficients, which remain after the operations responsible for data loss - DCT and quantization, which is beautifully displayed on the coding scheme (thanks to the Wiki National Library named after Bauman):
JPEG encoding

There are many possible methods for optimizing jpeg files. There is a lossless optimization (jpegtran), there is a lossless optimization, which in fact still contribute, but they don’t care about us. After all, if a user is ready to embed one information into another in order to increase free disk space, then he either optimized his images for a long time, or doesn’t want to do this at all, for fear of losing quality.

F5

Under this condition fits a whole family of algorithms, which can be found in this good presentation . The most advanced of these is the F5 algorithm, authored by Andreas Westfeld, which works with the coefficients of the luminance component, since the human eye is the least sensitive to its changes. Moreover, it uses the embedding method based on the matrix coding, so it is possible to make the less their changes when embedding the same amount of information, the larger the size of the container used.

The changes themselves are reduced to reducing the absolute value of the coefficients by one in certain conditions (that is, not always), which allows you to use F5 to optimize the storage of data on the hard disk. The fact is that the coefficient after such a change is likely to occupy a smaller number of bits after performing Huffman coding due to the statistical distribution of values in JPEG, and new zeros will give a gain when encoding them using RLE.

The necessary modifications are reduced to eliminating the part responsible for secrecy (password permutation), which saves resources and execution time, and adds a mechanism for working with multiple files instead of one at a time. In more detail, the process of changing the reader is unlikely to be interesting, so go to the description of the implementation.

High tech

To demonstrate the operation of this approach, I implemented the method on pure C and performed a number of optimizations both in terms of execution speed and memory (you can’t imagine how much these pictures weigh without compression even before DCT). Cross-platform is achieved using a combination of libjpeg , pcre and tinydir libraries , for which they thank you. All this is going to 'make''om, so Windows users for evaluation want to install some Cygwin for themselves, or deal with Visual Studio and libraries on their own.

The implementation is available in the form of a console utility and library. More information about the use of the latter can be found in the readme in the githaba repository, a link to which I will attach at the end of the post.

How to use?

Carefully. Images used for packaging are selected by searching for a regular expression in a given root directory. Upon completion, the files can be moved, renamed and copied at will within it, change the file and operating systems, etc. However, it is worth being extremely careful and not to change the direct content in any way. The loss of the value of even one bit can make it impossible to recover information.

Upon completion, the utility leaves a special archive file containing all the information necessary for unpacking, including data about used images. By itself, it weighs the order of a couple of kilobytes and does not have any significant effect on the occupied disk space.

You can analyze the possible capacity using the '-a' flag: './f5ar -a [search folder] [Perl-compatible regular expression]'. Packing is done with the command './f5ar -p [search folder] [Perl-compatible regular expression] [packaged file] [archive name]', and unpacking with './f5ar -u [archive file] [restored file name]' .

Demonstration of work

To show the effectiveness of the method, I uploaded a collection of 225 absolutely free photos of dogs from the Unsplash service and opened a large pdf-ku 45 meters in the second volume of the second volume of Knut's Art of Programming in documents.

The sequence is pretty simple:

$ du -sh knuth.pdf dogs/ 44M knuth.pdf 633M dogs/ $ ./f5ar -p dogs/ .*jpg knuth.pdf dogs.f5ar Reading compressing file... ok Initializing the archive... ok Analysing library capacity... done in 17.0s Detected somewhat guaranteed capacity of 48439359 bytes Detected possible capacity of upto 102618787 bytes Compressing... done in 39.4s Saving the archive... ok $ ./f5ar -u dogs/dogs.f5ar knuth_unpacked.pdf Initializing the archive... ok Reading the archive file... ok Filling the archive with files... done in 1.4s Decompressing... done in 21.0s Writing extracted data... ok $ sha1sum knuth.pdf knuth_unpacked.pdf 5bd1f496d2e45e382f33959eae5ab15da12cd666 knuth.pdf 5bd1f496d2e45e382f33959eae5ab15da12cd666 knuth_unpacked.pdf $ du -sh dogs/ 551M dogs/

Screenshots for fans

Skrnsht

The unzipped file can still and should be read:

As you can see, from the original 633 + 36 == 669 megabytes of data on the hard disk, we came up with a more pleasant 551. This radical difference is explained by a decrease in the coefficients that affect their subsequent lossless compression: reducing one can only calmly " cut "a couple of bytes from the final file. Nevertheless, it is still data loss, albeit extremely small, which will have to put up with.

Fortunately, they are absolutely not visible to the eye. Under the spoiler (since habrastorage does not know how to large files) the reader can estimate the difference both by eye and by intensity, obtained by subtracting the values of the modified component from the original: the original , with the information inside , the difference (the dimmer the color, the smaller the difference in the block ).

Instead of conclusion

Looking at all these difficulties, buying a hard disk or uploading everything to the cloud may seem like a much simpler solution to a problem. But even now we are living in such a wonderful time, there are no guarantees that tomorrow you can still go online and upload all your extra data somewhere. Or come to the store and buy another hard drive per thousand terabytes. But you can always use already lying at home.

-> GitHub

Source: https://habr.com/ru/post/453332/

All Articles