Working with tar and gz archives using PHP

As it often happens, it all started with the fact that I needed something that allowed me to process the tar.gz archives using php. Having rummaged on the Internet, I was surprised to find that nothing acceptable on this topic has been published.

What do we have?

1. PEAR extension for PHP http://pear.php.net/package/Archive_Tar Fine, but in my case - unacceptable, since I don’t have access to the server settings. Forcibly reject.
2. Alexey Valeev 's excellent article “Working with tar.gz archives in php” . That is necessary, but - alas. I needed a licensed “transparent” solution that could not cause questions. Therefore, the use of the library from Bitrix also did not work.
')
Actually, that's all.

Further sifting search engines did not give anything reasonable. A little thought, I got into the code of the popular net2ftp, which, as I remember, perfectly handles tar archives. It turned out that there is in the world a library of pcltar.lib.php from Vincent Blavet , 2001. GNU license. Everything is as it should. But! For a start I was confused by the size of the library itself 127 kilobytes. Well, I have a bzik from old times - I still count bytes. Then, I wanted to have the result in the form of a class, rather than individual functions. In addition, leaped excitement. I wanted to understand thoroughly.

As a result, I had to find a description of the structure of the tar archive (who cares, the title of the block of information is well described here ) and study the code. The result is below. I understand that the task is specific, but it is possible that someone will come in handy.

So, as you know, the tar archiver in the modern sense is not. Designed to save data on tape media, it is not able to compress, but simply combines multiple files into one, adding its own headers, and complementing the resulting code to an even number of blocks of 512 bytes. Then the result can already be compressed by the archiver. How? Yes, at least rar th. No difference. Although traditionally used for this format gzip and bzip2. Since they cannot link two files together (this is dictated by the “one program - one action” policy adopted in unix-systems). Support for gzip and bzip2 in PHP is provided by third-party libraries, and is not important to us. Tar is important.

Let's briefly analyze the file structure. As it should be, first comes the headline. After reviewing the documentation, I discovered that there are “old” and “new” header formats. New - 512 bytes long. Got it by adding additional fields to the “old” one. Theoretically, they are compatible, but we will focus on modernity. Let's try to disassemble it. Here, briefly, the essence:

100 bytes name - name (may contain a relative path);
8 bytes mode file mode
8 bytes uid - user ID
8 bytes gid - group ID
12 bytes size - file size, bytes (encoded in octal system)
12 bytes mtime - the date and time of the last modification in seconds of the UNIX epoch (encoded in the octal system)
8 bytes chksum - header checksum (not a file!)
1 byte typeflag - defines our file, or directory: file - 0, directory - 5
100 bytes linkname - link to file
- further - the “new” format fields - 6 bytes magic - contain the word “ustar”, i.e. sign of "new" format
2 bytes version - version of the new format (may be absent)
32 bytes uname - owner name
32 bytes gname - owner group name
8 bytes devmajor - the highest byte of the device code
8 bytes devminor - low byte device code
155 bytes prefix - prefix (extension) of the name

Unused bytes must be empty, although the code "20" (space) is allowed.

Most of this data is generally not required. Personally, I was interested in name, size and date.

Next comes the information part itself, supplemented (attention!) With empty bytes to a multiple of 512 bytes. And all over again for the next file. As you can see, everything is simple.

In essence, this knowledge is enough to try to pack the file.

1. Open the archive with the fopen (filename) command.

2. Title. This is the hardest part of the problem. I did not reinvent the bicycle, using the function from the mentioned library pcltar.lib.php, slightly optimizing it. I will not give all the code here because of the bulk, but the essence consists in the following actions:
- Determine the file name, its size, creation date, the rights placed on it. For catalogs, specify the size zero;
- Unused parameters are declared empty;
- The numerical parameters (size, date) translate into the octal system;
- Format each parameter in accordance with the declared size of the corresponding fields. There is one trick here that I didn’t figure out right away - in fact, a significant part of each field should be one byte smaller than the size of the field itself. The last byte must be empty. Otherwise, the archive is not readable.
- We pack all the parameters in two separate lines. In two, since there must be a header checksum between them.
- We consider this checksum, we format it by the same rules, we pack.
- And now we write three lines to the archive file: the first part of the parameters, the checksum and the second part of the parameters.

Done! Here is an example for creation time:

$mtime = sprintf("%11s ", DecOct(filemtime($filename)));
…
pack("a100a8a8a8a12a12", …, …, …, …, …, $mtime);

3. Everything is quite simple with the file body, Vincent Blavet in its library also processes it with the pack function. But I conducted several experiments with various files and did not see any distortions during packing / unpacking. Therefore, for the sake of winning performance, I did not do it - there is no point. Just read the data from the file, of course - after opening it, and write to the archive. Since the file sizes in my case could be quite large, I make it in blocks. I took the size of the block for 50 Kb.

$infile = fopen($filename, rb);
$j = ceil(filesize($filename) / 51200) + 1;
for($i=0; $i<$j; $i++){
$fr = fread($infile, 51200);
if ($this->tarmode == "tar")
@fputs($this->tarfile, $fr);
else
@gzputs($this->tarfile, $fr);
}
fclose($infile);

4. And now we finish to "equal". To do this, we need to know how many bytes are “not enough”. If the file is less than 512 bytes, then this is determined by subtracting its size from 512. If more, determine the remainder of dividing the file size by 512, and subtracting it from 512. The result is packed into the binary string.

It should also take into account the case when the file is originally a multiple of 512 bytes - some programs, independently supplement their files to the desired size. Of course, in this case nothing is needed to be added.

Here is the resulting code:

$ffs = filesize($filename);
if($ffs > 512)
$tolast = 512 - fmod($ffs, 512);
else
$tolast = 512-$ffs;
if($tolast != 512 && $tolast != 0){
$fdata = pack("a".$tolast, "");
.
}

The result is a tar archive. You can now repeat the operation with the next file, or close the archive.
If the Zlib library is connected to us, then in the process of creating the archive you can compress, having received as a result “tar.gz” or “tgz”, who likes what. The easiest way to check the library is by checking the FORCE_GZIP constant. To automate the process, I introduced such a check for all operations with the archive file. Like that:

if(defined('FORCE_GZIP'))
$resopen = @fopen($this->tarname, 'a+b');
else
$resopen = @gzopen($this->tarname, 'a+b'.$this->tarlevel);

In practice, I thus determine the future file extension, and already focusing on it, I use the necessary functions, but this is not important.

The remaining operations are much simpler. Since I did not need such functions as deleting files from the archive, or searching for them, I added only automatic detection of the presence of the Zlib library to my class, which I wrote about above, getting a list of files and unpacking any of them. Already when writing this article, it occurred to me to add a separate function to fully unpack the archive.

You can retrieve the list of files in the archive by finding and reading all the headers. To do this, read the first 512 bytes of the archive - in any case, this will be the header and unpack it with the unpack () function. Since unpack performs unpacking into an associative array, at the same time we assign clear names to the parameters. Like this:

unpack("a100name/a8perms/… …", “ ”)

Creation time and size must be converted back to decimal.

The resulting parameters can be given "on the way out". It remains only to shift the pointer in the archive file by the read size of the packed file plus the remainder up to a 512-byte block. Now he points to the beginning of the next heading, and the operation can be repeated anew.

Unpacking the desired file is reduced to searching for its header using the previous function, creating a file with the resulting name at the specified location, moving the file pointer in the archive to the beginning of the file, reading the number of bytes corresponding to its length, and writing to the created file. For directories, everything is limited to their creation.

The only two difficulties here are related to the features of the Zlib library:

First of all. It was found that in the gzopen function of this library, the “+” modifier is not implemented to open the file at the same time for writing and reading, similar to the fopen function. I had to abandon the single opening / closing of the archive file, and repeat these operations with each call, in accordance with the task.

Secondly, the documentation states (and I was convinced of the veracity of this instruction) that the gzseek function, similar to fseek, is “emulated, but works extremely slowly.” I had to abandon the direct offset of the pointer in the archive file to the desired position, replacing it with an “empty” reading, to the detriment of performance. If the case were limited to tar archives, this could have been avoided.

That's all. As a result, I had a completely universal library, with a size of just over 11 KB of uncompressed code. Download the library here: Archivator_tar-tar_gz.zip .

Always yours, PunkerPoock

Source: https://habr.com/ru/post/207470/

All Articles

Working with tar and gz archives using PHP

More articles: