Not so long ago I wrote about getting text from various file formats, be it
DOC or
PDF . Today, we will look at an equally interesting format - the RAR compression format. I will not reassure those who suffer - today we will only read the list of files without any additional PHP extensions. So, anyone interested, please under the cat ...
RAR is a good “bad” archiver
Let me remind you that RAR is being developed by our compatriot Evgeny Roshal. From him and got his name Roshal Archiver. The format is closed, which absolutely did not affect its distribution both in Russia and around the world. Almost all the workstations that I had to see were with the RAR archiver installed
and sometimes croaked .
During its development and being, the archiver has grown to 3
s (I suppose that there will soon be 4th version), which affected most of the "self-made" unzips: the third version introduced new compression algorithms, from which the latter fell into paranoia and heresy. Nevertheless, the developer's site
contains enough amount of various source codes to unzip RAR files for different platforms and development environments.
')
As for PHP, the
PECL extension has grown to a “stable” first version and is rarely installed on hosting systems. The extension, by the way, uses the very “unrar”, whose source code lies on the program's website. Moreover, I confess honestly, I didn’t manage to get the extension to work under 5.3 (under Windows), it worked under 5.2.11 php_rar.dll, but most of the archives could not be read. I would not be surprised that all the versions of the compiled library under the Windows system were for “some” other version, but I didn’t want to compile myself ... so in the evening I sat down to have a look, see what unrar.dll is, what I can assemble source on the site.
Rar - how is it?
Due to the closeness of the format - the documentation on it is scanty, even despite the fact that there are source codes for the release of data. Well, no wonder - about 600 kb of source code will be considered for very few people. Nevertheless, there are still enthusiasts (God forbid, if you thought of my person :) - this is why the project
The UniquE RAR File Library was created, which at times reduced the source code to unzip files created by version 2 of the archiver.
So, I came across the sources of the above-mentioned library, as well as the minimal, but at least some,
documentation on the aged 2.02 version of the archiver. Well, let's dive into what our RAR archives look like.
RAR-archive consists of blocks of variable length with headers of 7 bytes each. Any archive contains at least two MARK_HEAD and MAIN_HEAD blocks. The first contains information about what is in front of us RAR, and looks like "
52 61 72 21 1a 07 00
" in HEX. The third byte
072
just indicates that this is a Marker Header. The word
00 07
in little-endian contains the length of the block. Just the same 7 bytes.
The second block of the Main Header begins immediately after the first and must contain 13 bytes and have a marking byte of
0x73
. After it, data already starts in the file - be it a compressed file (market
074
in the third byte of the block header), a comment to the archive, additional information or, for example, a recovery record.
The algorithm for obtaining a list of files is not complicated (if you do not take into account the archives with an encrypted directory structure, the reading of which is left out of the scope of this article).
- We read the first seven bytes of the header. We find the length of the heading there and read it to the end;
- Check if the block is “file”;
- If “Yes”, then the DWORD in the seventh position is the size of the archived file (as well as the amount of data to be read before the next block), the next double word is the size of the source file, the position of the number 28 is the file attribute (DWORD), and Addresses 26 and 32 contain the length of the file name (2 bytes) and the name itself. In addition, there you can find the creation date, the OS code in which the file was created and the CRC;
- If the block is not “file”, then we must read the word in third position and check the value of its 15th bit, which is responsible for the additional amount of information that can go with the block. In the case of “1” for this position, we must skip the ADD_SIZE bytes (the first double word after the block header);
- And so on until the end of the file ...
Complicated? Not really, compared to some DOC files.
Source
- // Function of reading the file list from $ filename without using
- // PECL extensions rar.
- function rar_getFileList ( $ filename ) {
- // Function to get COUNT bytes from string (little-endian).
- // In order not to litter the global function space - send it
- // inside the parent.
- if ( ! function_exists ( "temp_getBytes" ) ) {
- function temp_getBytes ( $ data , $ from , $ count ) {
- $ string = substr ( $ data , $ from , $ count ) ;
- $ string = strrev ( $ string ) ;
- return hexdec ( bin2hex ( $ string ) ) ;
- }
- }
- // Attempt to open file
- $ id = fopen ( $ filename , "rb" ) ;
- if ( ! $ id )
- return false ;
- // Check whether the file is a RAR archive
- $ markHead = fread ( $ id , 7 ) ;
- if ( bin2hex ( $ markHead ) ! = "526172211a0700" )
- return false ;
- // Trying to read the MAIN_HEAD block
- $ mainHead = fread ( $ id , 7 ) ;
- if ( ord ( $ mainHead [ 2 ] ) ! = 0x73 )
- return false ;
- $ headSize = temp_getBytes ( $ mainHead , 5 , 2 ) ;
- // Move to the position of the first "significant" block in the file
- fseek ( $ id , $ headSize - 7 , SEEK_CUR ) ;
- $ files = array ( ) ;
- while ( ! feof ( $ id ) ) {
- // Read block header
- $ block = fread ( $ id , 7 ) ;
- $ headSize = temp_getBytes ( $ block , 5 , 2 ) ;
- if ( $ headSize <= 7 )
- break ;
- // Read the rest of the block based on the length of the header by
- // appropriate offset
- $ block . = fread ( $ id , $ headSize - 7 ) ;
- // If it is a file block, then we start processing it.
- if ( ord ( $ block [ 2 ] ) == 0x74 ) {
- // See how much a packed file takes in the archive and
- // move to the next position.
- $ packSize = temp_getBytes ( $ block , 7 , 4 ) ;
- fseek ( $ id , $ packSize , SEEK_CUR ) ;
- // Read file attributes: r - read only, h - hidden,
- // s - system, d - directory, a - archived
- $ attr = temp_getBytes ( $ block , 28 , 4 ) ;
- $ attributes = "" ;
- if ( $ attr & 0x01 )
- $ attributes . = "r" ;
- if ( $ attr & 0x02 )
- $ attributes . = "h" ;
- if ( $ attr & 0x04 )
- $ attributes . = "s" ;
- if ( $ attr & 0x10 || $ attr & 0x4000 )
- $ attributes = "d" ;
- if ( $ attr & 0x20 )
- $ attributes . = "a" ;
- // Read the file name, sizes before and after packing, CRC and attributes
- $ files [ ] = array (
- "filename" => substr ( $ block , 32 , temp_getBytes ( $ block , 26 , 2 ) ) ,
- "size_compressed" => $ packSize ,
- "size_uncompressed" => temp_getBytes ( $ block , 11 , 4 ) ,
- "crc" => temp_getBytes ( $ block , 16 , 4 ) ,
- "attributes" => $ attributes ,
- ) ;
- } else {
- // If this block is not a file block, then we skip it, taking into account the possible
- // additional offset ADD_SIZE
- $ flags = temp_getBytes ( $ block , 3 , 2 ) ;
- if ( $ flags & 0x8000 ) {
- $ addSize = temp_getBytes ( $ block , 7 , 4 ) ;
- fseek ( $ id , $ addSize , SEEK_CUR ) ;
- }
- }
- }
- fclose ( $ id ) ;
- // Return the file list
- return $ files ;
- }
You can get the code with comments on
GitHub .
Literature
Well, as usual literature for review:
Perspectives
As for reading files from archives, then ... this can theoretically be done in PHP by refactoring the library from UniquE, but this is only suitable for archives created with versions up to 2.90. The library will not read the new archives ... but you understand yourself to understand half a thousand kilobytes of code.