Read tar in 26 lines of ANSI C code

Archivers are scary! Huge and terrible algorithms that an ordinary person will never understand! Rar, zip, gzip, tar - modern standards are de facto, which means they are extremely complex and sophisticated things that are not worth trying to understand. Well, tar looks simpler, maybe it's not so complicated there? We look git with source codes. We see dozens of files, many tens of kilobytes. Hmm. Apparently a dead end.

__________________| |____________________________________________ ,--. ,--. ,--. ,--. |oo | _ \ `. | oo | | oo| oo|~~ |(_) / ; | ~~ | | ~~|ooooooooooo |/\/\| '._,' |/\/\| |/\/\| __________________ ____________________________________________ | |dwb

In fact, everything is not so difficult. The documentation described that tar is just a way to write multiple files to tape. Those. everything should be easy. In fact - a set of supporting information for each file and directly its contents. It is the understanding of this fact that allowed the reader to make tar files in 26 lines.

Why do we need a tar at all during the time of total zip dominance? For me, the question of using tar comes up when I want to get an archiver "for free" in my miniature C applications - i.e. with minimal growth of the executable file and without unnecessary dependencies. For example, the dxPmdxConverter utility can read BMP and convert to PNG using LodePNG . Those. the application already has a functionality that "archives" an array of pixels into a compressed PNG format. PNG is compressed using the Deflate algorithm , which is used in zip and gzip. Moreover, in gzip it is used directly - the gzip header is written, the data stream is from Deflate, the crc-sum. And the output is a ready-made .gz-file, which can be opened by any archiver. However, gzip can only compress one file. Those. Before compressing, several files must somehow be merged into one. The most common method for this is tar.

  ____ _ _ ____ __ ____ __ _ _ __ ____ ______ | _ \| \ | |/ ___| / / | _ \ ___ / _| | __ _| |_ ___ / / / ___|__ (_)_ __ | |_) | \| | | _ / / | | | |/ _ \ |_| |/ _` | __/ _ \ / / | | _ / /| | '_ \ | __/| |\ | |_| | / / | |_| | __/ _| | (_| | || __/ / / | |_| |/ /_| | |_) | |_| |_| \_|\____| /_/ |____/ \___|_| |_|\__,_|\__\___| /_/ \____/____|_| .__/ |_|

The next time tar came in handy for me in a similar situation. In order not to store Wordlase game resources in clear text, it was decided to archive them. Yes, you can pack as long as you want, on the developer's machine. But the resources will be unpacked every time you start the game. Those. The solution should work quickly. Public domain implementation of the compression algorithm was found on the Internet, but it also can pack only one file. So the hero of this publication was born - dxTarRead .

Advantages of dxTarRead:

does not require additional memory, only works with the transferred array
easy to build, just 1 function
does not depend on any libraries (even stdlibc is not used)
C89, i.e. theoretically correctly compiled even on a microwave using Visual Studio
Public Domain

The main disadvantage is that the tar file must be read into memory completely. On the other hand, resources will still be used, i.e. will load. Then why not load them from disk immediately, and then, if necessary, take them from the array with tar data.

So tar. Basic information on the standard can be found at GNU.org . Basically, the description of the "struct posix_header" structure was useful to me. From there, constants are taken:

  const int NAME_OFFSET = 0, SIZE_OFFSET = 124, MAGIC_OFFSET = 257; const int BLOCK_SIZE = 512, NAME_SIZE = 100, SZ_SIZE = 12, MAGIC_SIZE = 5;

In fact, these constants can be read like this: if we move 124 bytes from the beginning of the tar block (SIZE_OFFSET), then in the next 12 bytes (SZ_SIZE) we will keep the file size inside the tar.

Do not forget to read the documentation and more. There you can find out that the size is stored in the octal number system ;-) Ie if we read the bytes from the end of SZ_SIZE and add a digit multiplied by 8, then we get the size in the usual decimal form.

If we write the above in C, we get:

 const char* sz = tar + SIZE_OFFSET + currentBlockStart; long size = 0; for(i=SZ_SIZE-2, mul=1; i>=0; mul*=8, i--) /* Octal str to int */ if( (sz[i]>='1') && (sz[i] <= '9') ) size += (sz[i] - '0') * mul;

Coming up to the topic of tar-blocks. It’s just 512 bytes with data — either the tar header or the file bytes directly written in a row. If the last block of the file is less than 512 bytes, then 512 bytes are still reserved. Those. Each tar archive looks like this:

 +-------+-------+-------+-------+-------+-------+ | tar 1 | file1 | ... | file1 | tar 2 | file2 | ... +-------+-------+-------+-------+-------+-------+

There is a block with the tar header, in which the size of the stored file is specified. Next are N blocks with the contents of the file. Those. to go to the next file in the tar you need to shift to (N + 1) * 512 bytes. Code:

 newOffset = (1 + size/BLOCK_SIZE) * BLOCK_SIZE; /* trim by block size */ if( (size % BLOCK_SIZE) > 0 ) newOffset += BLOCK_SIZE;

The algorithm is as follows:

We read from the block the name of the file and its size.
If the file name is the same as what we are looking for, then we return the link to the user.
Otherwise, jump to the next block and repeat from step 1.

To compare the file name, I had to implement my analogue strncmp on a loop:

 i = 0; while((i<NAME_SIZE) && (fileName[i]!=0) && (name[i]==fileName[i])) i++; if( (i > 0) && (name[i] == 0) && (fileName[i] == 0) ) found = 1;

Everything. All source code is reviewed. Full function code:

 const char* dxTarRead(const void* tarData, const long tarSize, const char* fileName, long* fileSize) { const int NAME_OFFSET = 0, SIZE_OFFSET = 124, MAGIC_OFFSET = 257; const int BLOCK_SIZE = 512, NAME_SIZE = 100, SZ_SIZE = 12, MAGIC_SIZE = 5; const char MAGIC[] = "ustar"; /* Modern GNU tar's magic const */ const char* tar = (const char*) tarData; /* From "void*" to "char*" */ long size, mul, i, p = 0, found = 0, newOffset = 0; *fileSize = 0; /* will be zero if TAR wrong or there is no such file */ do { /* "Load" data from tar - just point to passed memory*/ const char* name = tar + NAME_OFFSET + p + newOffset; const char* sz = tar + SIZE_OFFSET + p + newOffset; /* size string */ p += newOffset; /* pointer to current file's data in TAR */ for(i=0; i<MAGIC_SIZE; i++) /* Check for supported TAR version */ if( tar[i + MAGIC_OFFSET + p] != MAGIC[i] ) return 0; /* = NULL */ size = 0; /* Convert file size from string into integer */ for(i=SZ_SIZE-2, mul=1; i>=0; mul*=8, i--) /* Octal str to int */ if( (sz[i]>='1') && (sz[i] <= '9') ) size += (sz[i] - '0') * mul; /* Offset size in bytes. Depends on file size and TAR's block size */ newOffset = (1 + size/BLOCK_SIZE) * BLOCK_SIZE; /* trim by block */ if( (size % BLOCK_SIZE) > 0 ) newOffset += BLOCK_SIZE; i = 0; /* strncmp - compare file's name with that a user wants */ while((i<NAME_SIZE) && (fileName[i]!=0) && (name[i]==fileName[i])) i++; if( (i > 0) && (name[i] == 0) && (fileName[i] == 0) ) found = 1; } while( !found && (p + newOffset + BLOCK_SIZE <= tarSize) ); if( found ){ *fileSize = size; return tar + p + BLOCK_SIZE; /* skip header, point to data */ } else return 0; /* No file found in TAR - return NULL */ }

Conclusion

tar does not compress data, but stores it in clear text. This is exactly what makes it possible not to allocate a new memory, but simply to return a pointer to the existing one.

The size of the tar-block is 512 bytes. In addition to each file, you must store the tar header. Those. a file of several bytes will occupy 1 kilobyte in a tarball. If you need to store a lot of small files and not compress the file, then tar is a bad choice.

→ dxTarRead on GitHub

PS Hub "I'm PR!" I have seen. And where is the hub "I'm looking for work"? :)

Source: https://habr.com/ru/post/320834/

All Articles

Read tar in 26 lines of ANSI C code

Conclusion

More articles: