Work with cab-archives via IStream

Some time ago I needed to compress the data directly in memory, and not to use anything third-party for this - that is, use built-in capabilities. The choice fell on Cabinet.dll as a means for data compression and on the IStream interface for working with data in memory. I didn’t find anything like this on the Internet, so I decided to share my work.

Introduction

I didn’t want to use third-party solutions, because I’d have to carry libraries with me or include source codes into the project. Windows does not provide such a large set of data compression / decompression tools: these are Cabinet.dll , ZipFldr.dll (compressed Zip folders), and RtlCompressBuffer / RtlDecompressBuffer . I couldn’t find any clear documentation on compressed Zip folders, RtlCompressBuffer / RtlDecompressBuffer in Windows 7 versions inclusive only supports LZ compression, but Cabinet.dll is present in the system right from Windows 95 to the present day.

As functions for working with files and memory, the documentation suggests using the standard C library functions or Windows API functions, such as CreateFile / CloseHandle / ReadFile / WriteFile. Since all operations on files were performed in memory, it was decided to use IStream for these purposes.
')

A little about Cabinet.dll

The library is functionally divided into 2 parts: FCI (file compression interface) and FDI (file decompression interface). You can read about it here . Both interfaces use essentially the same functions for working with files and memory, but for some reason Microsoft decided to make different prototypes for FCI and FDI. However, nothing prevents to describe one through another. How to do this, see below.

To use the library, you need to connect the files FCI.h and / or FDI.h, respectively, and indicate the linker on Cabinet.lib . All of these files are included with the Windows SDK.

Implementation of the compression interface

The simplest code that implements compression looks like this:

/*  : IStream* pIStreamFile —    ,      char* szFileName —    .   ,        */ ERF erf; CCAB ccab = {MAXINT, MAXINT}; *(IStream**)ccab.szCabPath = SHCreateMemStream(0, 0); //    HFCI hFCI = FCICreate(&erf, fPlaced, fAlloc, fFree, fOpen, fRead, fWrite, fClose, fSeek, fDelete, fTemp, &ccab, 0); if(hFCI){ FCIAddFile(hFCI, (PSZ)pIStreamFile, szFileName, 0, fGetNext, fStatus, fInfo, tcompTYPE_MSZIP); FCIFlushFolder(hFCI, fGetNext, fStatus); FCIFlushCabinet(hFCI, 0, fGetNext, fStatus); FCIDestroy(hFCI); } /*  : (IStream*)ccab.szCabPath — ,  cab-.     Release()   ! */

Those. the code itself is pretty simple. The whole point is the functions that are passed when creating the FCI context and further along the run. You can read about their parameters and return values here , therefore only the basic information will be indicated below. Below is an analysis of each function.

Here it should be added that we will have non-standard file descriptors in this regard - these are pointers to IStream . Because of this feature, you need to be careful with the transfer of this "descriptor". For example, in the CCAB structure there are 2 fields: szCabPath and szCab , and it would seem logical to pass the address to the 2nd parameter, but not. FCI performs string concatenation (or rather, he thinks that he concatenates strings, but we know ...), so the resulting “name” of the file will be szCabPath , and it will also be the descriptor.

fPlaced

Called every time a new file is added to the archive.

 FNFCIFILEPLACED(fPlaced){ return 0; }

Return -1 means an error, the other values are determined by the application. Can be used to indicate the addition of files, for example.

fGetNext

Called before creating a new archive volume.

 FNFCIGETNEXTCABINET(fGetNext){ return 1; }

If successful, returns TRUE ; otherwise, returns FALSE . Nothing remarkable.

fStatus

It is called at several stages of file processing: block compression, adding a compressed block and recording an archive.

 FNFCISTATUS(fStatus){ return typeStatus == statusCabinet ? cb2 : 0; }

In case of an error, you must return -1, otherwise - any value (except for typeStatus == statusCabinet - then you must return the size of the archive, which is passed through the parameter cb2 ).

fInfo

Sets file attributes.

 FNFCIGETOPENINFO(fInfo){ *pattribs = 0; return (INT_PTR)pszName; }

IStream does not support date attributes, and indeed file attributes, so the value at pattribs should be set to 0, otherwise you risk getting files in the archive with strange attributes (or even not getting the archive at all).

Return -1 means an error, otherwise you need to return a handle to the open file.

fTemp

Creating a temporary file.

 FNFCIGETTEMPFILE(fTemp){ *(IStream**)pszTempName = SHCreateMemStream(0, 0); return 1; }

If successful, returns TRUE , otherwise returns FALSE . The file name (pointer to IStream in this case) is passed through the pszTempName parameter.

fDelete

Delete the file.

 FNFCIDELETE(fDelete){ (*(IStream**)pszFile)->Release(); return 0; }

Returns 0 on success; -1 on failure. Deleting a file in this case is the release of the resources occupied by the stream, so we simply release () .

fAlloc, fFree

Allocation / release of memory.

 FNFCIALLOC(fAlloc){ return new char[cb]; } FNFCIFREE(fFree){ delete memory; }

It's all very simple, so I even combined these functions in one section.

fOpen

Opening file (stream).

 FNFCIOPEN(fOpen){ return *(INT_PTR*)pszFile; }

Since the file name in our case is equivalent to the file descriptor, which is why we return the name as a descriptor (well, or -1, if some kind of error has occurred).

fClose

Close the file descriptor.

 FNFCICLOSE(fClose){ LARGE_INTEGER li = {}; ((IStream*)hf)->Seek(li, 0, 0); return 0; }

Returns 0 on success; -1 on failure. Why not release () ? Because it "deletes the file", i.e. destroys the flow, while you only need to close it. So just reset the pointer to the beginning.

fRead, fWrite

Read / write data from file / to file.

 FNFCIREAD(fRead){ ULONG ul; HRESULT hr = ((IStream*)hf)->Read(memory, cb, &ul); return (hr && hr != S_FALSE) ? -1 : ul; } FNFCIWRITE(fWrite){ ULONG ul; HRESULT hr = ((IStream*)hf)->Write(memory, cb, &ul); return (hr && hr != S_FALSE) ? -1 : ul; }

Returns the number of bytes read / written or -1 in case of an error (0 - end of file reached).

fSeek

Positioning the pointer in the file.

 FNFCISEEK(fSeek){ LARGE_INTEGER liDist = {dist}; HRESULT hr =((IStream*)hf)->Seek(liDist, seektype, (ULARGE_INTEGER*)&liDist); return hr ? -1 : liDist.LowPart; }

Returns -1 on error; otherwise, a new pointer position.

Unpacking interface implementation

The unpacking code looks like this:

 /*  : IStream* pIStrCab —    */ ERF erf; HFDI hFDI = FDICreate(fAlloc, fFree, fnOpen, fnRead, fnWrite, fnClose, fnSeek, cpuUNKNOWN, &erf); if(hFDI){ IStream *pIStrSrc = SHCreateMemStream(0, 0); if(FDICopy(hFDI, (PSZ)&pIStrCab, (PSZ)&pIStrCab, 0, fnNotify, 0, &pIStrSrc)){ //    pIStrSrc } pIStrSrc->Release(); FDIDestroy(hFDI); } pIStrCab->Release(); /*  : IStream* pIStrSrc —     */

Here is not so simple. The fact is that the extraction of all files from the archive is initiated by a single function FDICopy , which in the course of its work calls fnNotify , where all the magic happens. But more on that later.

In general, the process is similar: we create an FDI context, a stream for the output data, extract the file from the archive into this stream (in my example, it was necessary to extract a single file) and destroy the context. (PSZ) & pIStrCab must be specified twice, because during its operation the function concatenates both parameters, and if you omit one of them, there will be an error (yes, I also stumbled upon such a rake).

Now a little about the functions. In general, they are similar to FCI functions, except that they do not have 2 parameters; memory allocation / release functions are generally identical, so it makes no sense to re-describe them. To reduce the amount of code, you can rewrite the FCI functions through the FDI functions in order not to specify extra zero parameters.

fnOpen, fnClose

Open / close file (stream).

 FNOPEN(fnOpen){ return *(INT_PTR*)pszFile; } FNCLOSE(fnClose){ return fClose(hf, 0, 0); }

fnOpen is easier to duplicate than calling fOpen , and in fnClose the FCI fClose function is called with 2 zero last parameters, because they are not used in this implementation.

fnRead, fnWrite, fnSeek

Reading / writing data and positioning the pointer.

 FNREAD(fnRead){ return fRead(hf, pv, cb, 0, 0); } FNWRITE(fnWrite){ return fWrite(hf, pv, cb, 0, 0); } FNSEEK(fnSeek){ return fSeek(hf, dist, seektype, 0, 0); }

Returned values are the same as for FCI.

fnNotify

The most important function.

 FNFDINOTIFY(fnNotify){ if(fdint == fdintCOPY_FILE) if(!lstrcmp(pfdin->psz1, "Data")) //   ,    return (INT_PTR)*(int*)pfdin->pv; return fdint == fdintCLOSE_FILE_INFO; }

All information on the function can be found here . Here you need a few explanations.
In most cases, the function returns 0 as an indicator of success (except fdintCLOSE_FILE_INFO , then return TRUE ). When fdint == fdintCOPY_FILE, the behavior is as follows: 0 means the file is skipped, -1 is an error ( FDICopy completion), another value is the stream descriptor to which data should be extracted.

Now the fun begins, because if we create threads in this function, we will not get access to them from the outside. Therefore, there are at least 2 solutions, and both of them affect the hitherto unused and therefore inconspicuous last parameter pvUser of the FDICopy function. Through it, you can transfer user data, and it is he who returns to pfdin-> pv . The first way is if you have a fixed list of file names that you need to extract from the archive, then you can transfer it as an array of structures containing the required file name and a pointer to IStream to extract to it. The second way is when the number of files is unknown, and you need to extract them all; In this case, through pvUser, you can pass the address of the container (for example, std :: vector ), in which the names and descriptors of the extracted files will be stored.

Afterword

This method is suitable for cases where the resulting data size is not particularly large - about a hundred megabytes. Of course, in the presence of 8+ GB of memory, it is not such a big expense, but remember that the operation of re-allocating memory is not the fastest operation, which also leads to memory fragmentation, as a result of which a sudden enough continuous you will not have a memory block.

As some alternative, you can use structured storage (there is the same IStream ) or file streams created with SHCreateStreamOnFile / SHCreateStreamOnFileEx . Thus, it is possible to combine input / output operations in memory with similar operations in files, since The iStream interface can be used in both cases without any additional manipulations.

If you have any questions about the implementation, I am ready to answer them in the comments.

Source: https://habr.com/ru/post/314832/

All Articles