Still the best ZIP bomb

The article shows how to create a non - recursive zip-bomb , which provides a high degree of compression by overlapping files inside a zip-container. “Non-recursive” means that it does not depend on the recursive decompressing by decompressor of files embedded in zip-archives: there is only one round. The output size increases quadratically from the input, reaching a compression ratio of over 28 million (10 MB → 281 TB) within the zip format. An even larger expansion is possible with 64-bit extensions. The design uses only the most common compression algorithm DEFLATE and is compatible with most zip parsers.

zbsm.zip 42 kB → 5.5 GB
zblg.zip 10 MB → 281 TB
zbxl.zip 46 MB → 4.5 PB (Zip64, less compatible with parsers)

Source:

  git clone https://www.bamsoftware.com/git/zipbomb.git

zipbomb-20190702.zip

Data and sources of illustrations:

  git clone https://www.bamsoftware.com/git/zipbomb-paper.git

		non-recursive		recursive
	archive size	uncompressed size	ratio	uncompressed size	ratio
Quine Cox	440	440	1.0	∞	∞
Quine Ellingsen	28,809	42 569	1.5	∞	∞
42.zip	42 374 *	558 432	13.2	4 507 981 343 026 016	106 billion
this technique	42 374	5 461 307 620	129 thousand	5 461 307 620	129 thousand
this technique	9,893,525	281 395 456 244 934	28 million	281 395 456 244 934	28 million
this technique (zip64)	45 876 952	4 507 981 427 706 459	98 million	4 507 981 427 706 459	98 million

_{* There are two versions of 42.zip: the old 42,374 bytes, and the newer one at 42,838 bytes.} _{The difference is that a new password is required before unpacking.} _{We compare only with the old version.} _{Here is a copy of the file, if you need one: 42.zip .} _')_{** I would like to know and indicate the author of 42.zip, but could not find it - let me know if you have any information.}

Zip bombs must overcome the fact that the DEFLATE compression algorithm most often supported by parsers cannot exceed a compression ratio of 1032 to 1. For this reason, zip bombs usually rely on recursive decompression, putting zip files into zip files to get an additional factor 1032 with each layer. But the trick works only in implementations that unpack recursively, but most do not. The most famous 42.zip bomb expands to a formidable 4.5 PB, if all six layers are recursively unpacked, but on the top layer, it has a trifling 0.6 MB. Zip quains, like those of Cox and Ellingsen , give out a copy of themselves and, thus, expand infinitely with recursive unpacking. But they are also completely safe when unpacking once.

This article shows you how to create a non-recursive zip-bomb whose compression ratio exceeds the DEFLATE limit of 1032. It works by overlapping files inside a zip-container to refer to the “core” of highly compressed data in several files without making multiple copies. The output size of the zip-bomb grows quadratically from the input size; that is, the compression ratio improves as the size of the bomb increases. The design depends on the features of zip and DEFLATE: it is not transferred directly to other file formats or compression algorithms. The bomb is compatible with most zip parsers, except for “streaming”, which analyze files in one pass, without checking the central directory of the zip file. We try to balance two conflicting goals:

Increase the compression ratio. We define the compression ratio as the sum of the sizes of all files in the archive divided by the size of the zip file itself. It does not take into account file names or other file system metadata, but only the content.
Save compatibility. Zip is a complex format, and parsers are different, especially in border situations and with additional features. Do not use techniques that work only with certain parsers. We will note some ways to improve the efficiency of a zip-bomb with a certain loss of compatibility.

Zip file structure

A zip file consists of a central directory of file links.

The central directory is at the end of the zip file. This is a list of headings for the central catalog . Each central directory header contains single-file metadata, such as the file name and CRC-32 checksum, as well as a reverse pointer to the local file header. The central directory header has a length of 46 bytes plus the length of the file name.

A file consists of a local file header followed by compressed file data. The length of the local file header is 30 bytes plus the length of the file name. It contains a redundant copy of the metadata from the header of the central catalog, as well as the sizes of the compressed and uncompressed data files behind it. Zip is a container format, not a compression algorithm. The data of each file is compressed using the algorithm specified in the metadata, usually DEFLATE .

This description of the zip format omits many details that are not needed to understand the zip-bomb. For full details, see section 4.3 APPNOTE.TXT or the “Structure of the PKZip File” by Florian Buchholz, or see the source code .

Considerable redundancy and a lot of ambiguities in the zip format open up possibilities for the mischief of various kinds. Zip-bomb - this is only the tip of the iceberg. Links for further reading:

“Ten Thousand Security Traps: ZIP Format” , a lecture by Ginvael Coldwind
The ambiguous zip parsing allows you to hide Firefox add-on files from the linter and people : my vulnerability is on addons.mozilla.org

First discovery: overlapping files

By compressing a long string of duplicate bytes, we can create a highly compressed data core . By itself, the kernel compression ratio cannot exceed the DEFLATE limit of 1032, so we need a way to reuse the kernel in many files, without creating a separate copy of it in each file. We can do this by overlapping files: make sure that many of the headers of the central directory point to a single file whose data is the core.

Consider an example of how this design affects the degree of compression. Suppose a 1000 byte kernel is unpacked into 1 MB. Then the first megabyte of output is “worth” 1078 bytes of input data:

31 bytes for the local file header (including the 1-byte file name)
47 bytes for the central directory header (including the 1-byte file name)
1000 bytes per core

But every 1 MB of output after the first one is only 47 bytes - we don’t need another local file header or another copy of the kernel, just an additional header of the central directory. Thus, while the first link from the core has a compression ratio of 1,000,000 / 1078 ≈ 928, each additional link moves the factor closer to 1,000,000 / 47 ≈ 21,277, and the large core raises the ceiling.

The problem with this idea is the lack of compatibility. Since many central directory headers point to a single local file header, the metadata (in particular, the file name) cannot be the same for each file. Some parsers swear at this . For example, Info-ZIP UnZip (standard unzip program in Unix) extracts files, but with warnings:

  $ unzip overlap.zip
   inflating: A
 B: mismatching "local" filename (A),
          continuing with "central" filename version
   inflating: B
 ...

Python zipfile also throws an exception :

  $ python3 -m zipfile -e overlap.zip.
 Traceback (most recent call last):
 ...
 __main __. BadZipFile: File name in directory 'B' and header b'A 'differ.

Next, we look at how to change the design for consistency of file names, while retaining most of the advantages of overlapping files.

Second discovery: quoting local file headers

We need to separate the headers of the local files for each file, while reusing one core. Simply merging all the headers does not work, because the zip parser will find that local file header where it waits for the beginning of the DEFLATE stream. But the idea will work, with a few changes. We will use the DEFLATE function of uncompressed blocks to “quote” local file headers so that they appear to be part of the same DEFLATE stream that ends in the kernel. Each local file header (except the first one) will be interpreted in two ways: as a code (part of the zip file structure) and as data (part of the file's contents).

The DEFLATE stream is a sequence of blocks , where each block can be compressed or uncompressed. We usually think only of compressed blocks, for example, the core is one large compressed block. But there are uncompressed ones that start with a 5-byte header with a length field, which means simply: “output the next n bytes verbatim”. Unpacking an uncompressed block only means deleting a 5-byte header. Compressed and uncompressed blocks can be freely mixed in the DEFLATE stream. The output is a concatenation of the results of unpacking all the blocks in order. The term “uncompressed” has meaning only at the DEFLATE level; File data is still considered “compressed” at the zip level, regardless of which blocks are used.

The easiest way to present this construction as an internal overlap, from the last file to the first. We start by inserting the kernel, which will form the end of the data file for each file. Add the header of the local LFH _N file and the header of the central CDH _N directory that points to it. Set the “compressed size” metadata field in LFH _N and CDH _N to the compressed kernel size. Now add a 5-byte header of an uncompressed block (green on the diagram), the length field of which is equal to the size of the LFH _N. Add the second header of the local LFH file _{N −1} and the header of the central directory CDH _{N −1} , which points to it. Set the “compressed size” metadata field as the new header of the compressed kernel size plus the size of the uncompressed block header (5 bytes) plus the size of the LFH _N.

At the moment, the zip file contains two files with the names Y and Z. Let's go through what the parser sees when parsing. Suppose that the compressed size of the kernel is 1000 bytes, and the size of the LFH _N is 31 bytes. We start with CDH _{N −1} and follow the pointer to LFH _{N −1} . The name of the first file is Y, and the compressed size of its data file is 1036 bytes. Interpreting the next 1036 bytes as a DEFLATE stream, we first encounter a 5-byte header of an uncompressed block that says to copy the next 31 bytes. We write the next 31 bytes, which are LFH _N , which we unpack and add to the Y file. Moving further in the DEFLATE stream, we find the compressed block (core), which we unpack to the Y file. Now we have reached the end of the compressed data and finished with the Y file.

Moving on to the next file, we follow the pointer from CDH _N to LFH _N and find a file named Z, the compressed size of which is 1000 bytes. Interpreting these 1000 bytes as a DEFLATE stream, we immediately encounter a compressed block (core again) and unpack it into file Z. Now we have reached the end of the final file and finished. The output file Z contains the unpacked kernel; the output file Y is the same, but additionally with a prefix of 31 bytes LFH _N.

We complete the construction by repeating the citation procedure until the zip archive includes the necessary number of files. Each new file adds a central directory header, a local file header and an uncompressed block for quoting the next local file header itself. Compressed file data is usually a chain of uncompressed DEFLATE blocks (quoted local file headers), followed by a compressed kernel. Each byte in the core contributes about 1032 N to the output size, because each byte is part of all N files. Output files are also of different sizes: the earlier ones are larger than the later ones, because they cite local file headers more. The content of the output files does not make much sense, but no one said that it should make sense.

This citation design with interleaving has better compatibility than the full overlap design from the previous section, but compatibility is achieved due to the degree of compression. There, each file added only cost the central directory header, here it stands for the central directory header, the local file header, and another 5 bytes for the citation header.

Optimization

Having obtained the basic design of the zip-bomb, we will try to make it as efficient as possible. We want to answer two questions:

What is the maximum compression ratio for a given zip file size?
What is the maximum compression ratio, given the limitations of the zip format?

Kernel compression

It is beneficial for us to compress the kernel as much as possible, because each unpacked byte is multiplied by N. For this purpose, we use a custom DEFLATE compressor called bulk_deflate, specialized in compressing a string of duplicate bytes.

All decent DEFLATE archivers are approaching a compression ratio of 1032 on an endless stream of duplicate bytes, but we are concerned about a specific size. In our archive size, bulk_deflate holds more data than general-purpose archivers: about 26 KB more than zlib and Info-ZIP, and about 15 KB more than Zopfli , which sacrifices speed for the sake of compression quality.

The price of the high compression ratio bulk_deflate is the lack of versatility. It can compress only strings of duplicate bytes and only a certain length, namely 517 + 258 k for an integer k ≥ 0. In addition to good compression, bulk_deflate works quickly, performing work at almost the same time regardless of the size of the input data, not counting the job of actually writing a compressed string.

File names

For our purposes, file names are almost dead weight. Although they contribute to the output size, being part of the quoted local file headers, but the byte in the file name contributes much less than the byte in the kernel. We want the file names to be as short as possible, while being different, not forgetting compatibility.

Each byte spent on a file name means two bytes not spent on the kernel (two, because each file name appears twice: in the header of the central directory and in the header of the local file). The file name byte results in an average of only ( N + 1) / 4 bytes of output, while the byte in the kernel is counted as 1032 N. Examples: 1 , 2 , 3 .

The first compatibility consideration is encoding. The zip format specification states that file names should be interpreted as CP 437 or UTF-8 if a certain flag bit is set ( APPNOTE.TXT, Appendix D ). This is the main point of incompatibility between zip parsers, which can interpret file names in some fixed or locale-specific encoding. Therefore, for compatibility, it is better to limit to characters with the same encoding in both CP 437 and UTF-8. Namely, these are 95 printable US-ASCII characters.

We are also bound by file system naming restrictions. Some file systems are case-insensitive, so 'a' and 'A' are not considered different names. Common file systems, such as FAT32, prohibit certain characters , such as '*' and '?'.

As a safe, but not necessarily optimal compromise, our zip-bomb will use 36-character alphabet file names, which do not include special characters and symbols from different registers:

  0 1 2 3 4 5 6 7 8 9 ABCDEFGHIJKLMNOPQRSTU VWXYZ

File names are generated in an obvious way, all positions in turn, with the addition of a position at the end of the cycle:

  "0", "1", "2", ..., "Z",
 "00", "01", "02", ..., "0Z",
 ...,
 "Z0", "Z1", "Z2", ..., "ZZ",
 "000", "001", "002", ...

There are 36 single-character file names, 36² two-character names, and so on. Four bytes is enough for 1,727,604 different file names.

Given that the file names in the archive will usually have different lengths, how should they be better arranged: from the shortest to the longest, or vice versa? If you think a little, it is better to put the longest names last. This ordering adds more than 900 MB of output to zblg.zip , compared with the ordering from the longest to the shortest. However, this is a minor optimization, because 900 MB is only 0.0003% of the total issue size.

Core size

The overlap citation design allows you to place a compressed data core, and then cheaply copy it many times. For a specific size of the zip file X , how much space is optimally allocated for storing the kernel, and how much to create copies?

To find the optimal balance, you need to optimize only one variable N , the number of files in the zip archive. Each N value requires a certain amount of overhead for headings in the central catalog, local file headers, citation block headers and file names. The rest of the space will be occupied by the core. Since N must be an integer, and you can put only a certain number of files before the kernel size drops to zero, it is enough to check each possible value of N and choose what gives the greatest output.

The application of the optimization procedure to X = 42,374 for 42.zip finds a maximum at N = 250. These 250 files require 21,195 bytes of service data, leaving 21,179 bytes for the kernel. A kernel of this size is unpacked into 21,841,249 bytes (a ratio of 1031.3 to 1). 250 copies of the unpacked kernel plus slightly quoted local file headers give a total unpacked output of 5,461,307,620 bytes and a compression ratio of 129,000.

zbsm.zip 42 kB → 5.5 GB

  zipbomb --mode = quoted_overlap --num-files = 250 --compressed-size = 21179> zbsm.zip

The optimization resulted in an almost uniform distribution of space between the core and the file headers. This is no coincidence. Consider a simplified model of the construction of quotation with overlap. In the simplified model, we ignore the file names, as well as a slight increase in the size of the output file due to quoting local file headers. An analysis of the simplified model will show that the optimal distribution between the core and file headers is approximately even, and the output size with optimal distribution grows quadratically depending on the size of the input.

Defining some constants and variables:

X		zip file size (considered fixed)
N		number of files in the zip archive (variable for optimization)
CDH	= 46	size of the central directory header (without file name)
Lfh	= 30	local file header size (without file name)
Q	= 5	uncompressed header size of a DEFLATE block
C	≈ 1032	kernel compression ratio

Let H (N) be the amount of overhead for headers for N files. See diagram to understand the essence of the formula.

H (N) = N \cdot (C D H + L F H) + (N - 1) \cdot Q

$H (N) = N ⋅ (CDH + LFH) + (N - 1) ⋅ Q$

For the core, there is an X - H (N) space. The total unpacked size S _X (N) is equal to the size of N copies of the kernel, unpacked with the C ratio (in this simplified model we ignore a slight additional extension from the mentioned local file headers).

$$ display $$ S_X (N) = (X - H (N)) CN \\ = (X - (N ⋅ (CDH + LFH) + (N - 1) Q)) CN \\ = - (CDH + LFH + Q) CN ^ 2 + (X + Q) CN $$ display $$S _X (N) is a polynomial in part N, therefore the maximum should be where the derivative of S ′ _X (N) is zero. Taking the derivative and finding zero gives us N _OPT , the optimal number of files.$$ display $$ S′X (N_ {OPT}) = −2 (CDH + LFH + Q) C N_ {OPT} + (X + Q) C \\ 0 = −2 (CDH + LFH + Q) C N_ {OPT} + (X + Q) C \\ N_ {OPT} = (X + Q) / (CDH + LFH + Q) / 2 $$ display $$H (N _OPT ) gives the optimal amount of space to accommodate file headers. It is independent of CDH, LFH and C and is close to X / 2 .$$ display $$ H (N_ {OPT}) = N_ {OPT} (CDH + LFH) + (N_ {OPT} - 1) ⋅ Q \\ = (X - Q) / 2 $$ display $$S _X (N _OPT ) - total unpacked size with optimal distribution. From this we see that the size of the output grows quadratically with increasing input data.

S_{X} (N_{O P T}) = (X + Q)^{2} C / (C D H + L F H + Q) / 4

$S_X (N_ {OPT}) = (X + Q) ^ 2C / (CDH + LFH + Q) / 4$

Increasing the size of the zip file, we will eventually encounter the limits of the zip format: there can be no more than 2 ¹⁶ −1 files in the archive no larger than 2 ³² −1 bytes each. Worse, some implementations take maximum values as an indicator of the presence of 64-bit extensions , so our limits are in fact 2 ¹⁶ −2 and 2 ³² −2. It happens that the first we are faced with a limit on the size of an uncompressed file. With a zip file size of 8,319,377 bytes, a naive optimization will give us a number of files of 47,837 and a maximum file of 2,332,111 bytes.

(Actually, everything is a bit more complicated, because the exact limits depend on the implementation. Python's zipfile ignores the number of files, archive / zip in Go allows you to increase the number of files until they have the lower 16 bits. But for general compatibility, we must adhere to established limits).

If we cannot infinitely increase N or the size of the core, we would like to find the maximum degree of compression within the zip format. You need to make the kernel as large as possible with the maximum number of files. Despite the fact that we can no longer maintain a roughly even separation between the kernel and the file headers, each added file still increases the compression ratio — just not as fast as if we could continue to grow the kernel. In fact, as we add files, we will need to reduce the size of the kernel in order to make room for a file of maximum size, which grows a little with each file added.

The plan leads to a zip archive with 2 ¹⁶ −2 files and a kernel, which is unpacked into 2 ³² −2 178 825 bytes. Files will be larger towards the beginning of the zip archive — the first and largest file is unpacked into 2 ³² −56 bytes. This is as close as we can get using coarse output parameters of bulk_deflate - encoding the final 54 bytes will cost more than their size (the zip file generally has a compression ratio of 28 million, and the final 54 bytes will receive a maximum of 54 10 ³² ( 2 ¹⁶ - 2) ≈ 36? 5 million bytes, so it only helps if 54 bytes can be encoded into one byte - and I could not encode less than two. Therefore, if you cannot encode 54 bytes into 1 byte, it only reduces the degree of compression). The size of this output zip-bomb: 281 395 456 244 934 bytes, 99.97% of the theoretical maximum (2 ³² - 1) × (2 ¹⁶ - 1). Any significant improvement in the compression ratio can only be achieved by reducing the size of the input signal, and not increasing the output signal.

zblg.zip 10 MB → 281 TB

  zipbomb --mode = quoted_overlap --num-files = 65534 --max-uncompressed-size = 4292788525> zblg.zip

Efficient CRC-32 Calculation

Among the metadata in the header of the central directory and the header of the local file is a checksum of uncompressed file data - CRC-32 . This creates a problem because the amount of CRC-32 computing for each file is proportional to its size, which is by default very large (after all, it's a zip-bomb). We would prefer to do work that is at least proportional to the size of the archive. Two factors work in our favor: all files have a common core, and an uncompressed kernel is a string of duplicate bytes. Imagine CRC-32 as a matrix product - this will allow not only to quickly calculate the kernel checksum, but also to reuse calculations between files. The method described in this section is a small extension of the crc32_combine function in zlib, which Mark Adler explains here .

CRC-32 can be modeled as a state machine, updating a 32-bit status register for each input bit. The basic update operations for bits 0 and 1 are:

 uint32 crc32_update_0(uint32 state) { // Shift out the least significant bit. bit b = state & 1; state = state >> 1; // If the shifted-out bit was 1, XOR with the CRC-32 constant. if (b == 1) state = state ^ 0xedb88320; return state; } uint32 crc32_update_1(uint32 state) { // Do as for a 0 bit, then XOR with the CRC-32 constant. return crc32_update_0(state) ^ 0xedb88320; }

If you represent the state register as a 32-element binary vector and use XOR for addition and multiplication, then crc32_update_0 is a linear map ; that is, it can be represented as multiplication by a 32 × 32 binary transition matrix . To understand why, note that multiplying a matrix by a vector is simply the sum of the columns of the matrix after multiplying each column by the corresponding element of the vector. The shift operation state >> 1 simply takes each bit i of the state vector and multiplies it by a vector that is zero everywhere except for bit i - 1 (the numbering of the bits is from right to left). Relatively speaking, the final XOR state ^ 0xedb88320 occurs only when the bit b is equal to one. This can be represented as the first multiplication of b by 0xedb88320, and then XORing into a given state.

In addition, crc32_update_1 is just a crc32_update_0 plus (XOR) constant.This makes crc32_update_1 an affine transformation : multiplication of a matrix followed by a mapping (i.e., vector addition). We can imagine matrix multiplication and mapping in one step, if we increase the size of the transformation matrix to 33 × 33 and add an additional element to the state vector, which is always equal to 1 (this representation is called homogeneous coordinates ).

The transformation matrices are 33 × 33 M ₀ and M ₁ , which compute the change in state of the CRC-32 performed by bits 0 and 1, respectively. Column vectors are stored with the most significant bit at the bottom of: reading the first column from the bottom up, you see a constant polynomial CRC-32 edb8832016 = 111 0 11 0 110 111 000 1 00000 11 00 1 00000 ₂ . These two matrices differ only in the final column, which represents the translation vector in homogeneous coordinates. In M _{0, the} translation is zero, and in M ₁ , the edb88320 ₁₆ is the polynomial constant CRC-32. Units are immediately above the diagonal status of the operationstate >> 1

Both operationscrc32_update_0andcrc32_update_1can be represented by the transition matrix of 33 × 33. Matrices M₀ and M₁ shown. The advantage of the matrix representation is that matrices can be multiplied. Suppose we want to see a state change made by processing an ASCII character 'a', whose binary representation is 01100001 ₂ . We can represent the cumulative state change of the CRC-32 of these eight bits in one transformation matrix:

M_{a} = M_{0} M_{1} M_{1} M_{0} M_{0} M_{0} M_{0} M_{1}

$M_a = M_0 M_1 M_1 M_0 M_0 M_0 M_0 M_1$

And we can imagine changing the state of the row of repeating 'a' by multiplying many copies of M _a - raising the matrix to a power. We can quickly do this using the fast exponentiation algorithm , which allows us to calculate M ^{n in} just a log ₂ ⁿ steps. For example, here is a matrix for changing the state of a string of 9 characters 'a':

(M_{a})^{9} = M_{a} M_{a} M_{a} M_{a} M_{a} M_{a} M_{a} M_{a} M_{a} = (M_{a} M_{a} M_{a} M_{a})^{2} M_{a} = ((M_{a} M_{a})^{2})^{2} M_{a} = (((M_{a})^{2})^{2})^{2} M_{a}

$(M_a)^9 = M_a M_a M_a M_a M_a M_a M_a M_a M_a\\ = (M_a M_a M_a M_a)^2 M_a\\ = ((M_a M_a)^2)^2 M_a\\ = (((M_a)^2)^2)^2 M_a$

The fast matrix multiplication algorithm is useful for calculating the M _kernel , the matrix for an uncompressed kernel, since the kernel is a string of duplicate bytes. To get the CRC-32 checksum from the matrix, multiply the matrix by the zero vector (the zero vector is in homogeneous coordinates, that is, 32 zeros and then one; here we omit the minor complication of pre- and post-processing the checksum to check for consistency). To calculate the checksum for each file, we work in the opposite direction. We start with the initialization M: = M _kernel . The kernel checksum is also the checksum of the final file N , so multiply Ma zero vector and stores the received checksum in the CDH _N and LFH _N . The data of file N − 1 is the same as the file data of file N , but with the prefix LFH _N added . Therefore, we calculate

, state change matrix for LFH_Nand update

M_{L F H_{N}}

$M_{L{FH_N}}$

M := M M_{L F H_{N}}

$M := M M_{L{FH_N}}$ .Now M represents a cumulative change of state from LFH _N processing by core. We calculate the checksum for the file N − 1 by multiplying M by the zero vector again . We continue the procedure, accumulating state change matrices in M , until all files are processed.

Extension: Zip64

Previously, we ran into an extension problem due to the limitations of the zip format - it was impossible to issue more than 281 TB, no matter how cleverly the zip file was packed. You can exceed these limits using Zip64, a zip file extension that increases the size of some header fields to 64 bits. Zip64 support is by no means universal, but it is one of the most commonly implemented extensions. As for the compression ratio, the effect of Zip64 is to increase the size of the central directory header from 46 to 58 bytes, and the size of the local directory header from 30 to 50 bytes. Looking at the formula for the optimal expansion coefficient in the simplified model, we see that the zip-bomb in the Zip64 format still grows quadratically, but slower because of the larger denominator - this is seen in the diagram below. Due to the loss of compatibility and slower growth, almost any restrictions on file size are removed.

Suppose we need a zip-bomb that expands to 4.5 PB, like 42.zip. How big should the archive be? Using a binary search, we find that the minimum size of such a file is 46 MB.

zbxl.zip 46 MB → 4,5 PB (Zip64, less compatible)

 zipbomb --mode = quoted_overlap --num-files = 190023 --compressed-size = 22982788 --zip64> zbxl.zip

4.5 petabyte - approximately as much data was recorded by the Telescope of the event horizon for the first image of a black hole : racks and racks with hard drives in the data center.

With Zip64, it’s almost uninteresting to consider the maximum compression ratio, because we can simply continue to increase the size of the zip file and the compression ratio with it, until even a compressed zip file becomes prohibitively large. An interesting threshold, however, is 2 ⁶⁴ bytes (18 EB or 16 EiB) - so much data will not fit in most file systems. Binary search finds the smallest zip bomb that produces at least as much output: it contains 12 million files and a compressed 1.5 GB kernel. The total size of the zip file is 2.9 GB, and it is unpacked in 2 ⁶⁴+11,727,895,877 bytes with a compression ratio of over 6.2 billion to one. I did not upload this file for download, but you can generate it yourself using the source code . It has files of such size that even a bug in Info-ZIP UnZip 6.0 was revealed.

 zipbomb --mode = quoted_overlap --num-files = 12056313 --compressed-size = 1482284040 --zip64> zbxxl.zip

Extension: bzip2

DEFLATE is the most common compression algorithm for the zip format, but this is only one of many options. Probably the second most common algorithm is bzip2 . Although it is not as compatible as DEFLATE. Theoretically, in bzip2, the maximum compression ratio is about 1.4 million to one, which allows for a closer packing of the core.

bzip2 first of all encodes a “run step” (run-length encoding), reducing the length of a string of duplicate bytes by 51 times. Then the data is divided into blocks of 900 KB and each block is compressed separately. Theoretically, one block can shrink to 32 bytes. 900,000 × 51/32 = 1 434 375.

Ignoring the loss of compatibility, does bzip2 allow a more efficient bomb to be made?

Yes - but only for small files. The problem is that in bzip2 there is nothing like the uncompressed DEFLATE blocks that we used to cite local file headers. Thus, it is impossible to overlap files and reuse the kernel - for each file you need to write your own copy, so the overall compression ratio will be no better than the coefficient for any single file. On the graph below, we see that without overlapping bzip2 exceeds DEFLATE only for files of about a megabyte.

There is only hope for an alternative means of citing headlines in bzip2, which is discussed in the next section. Also, if you know that a particular zip parser supports bzip2 andadmits a file name mismatch, you can use a full overlap construction that does not need quoting.

Comparison of compression ratio for different zip-bombs. Pay attention to the logarithmic axis. Each design is shown with and without Zip64. In structures without overlap, the linear growth rate is evident from the constant ratio of the axes. The vertical displacement of the bzip2 graphic means that the compression ratio of bzip2 is about a thousand times greater than that of DEFLATE. The DEFLATE citation constructs have a quadratic growth rate, as evidenced by a slope to the 2: 1 axes. The Zip64 option is slightly less efficient, but allows you to issue more than 281 TB. Graphs for bzip2 with quotations through an additional field are transferred from quadratic to linear when either the maximum file size is reached (2 ³² −2 bytes), or the maximum allowed number of files

Expansion: quoting through an additional field

So far, we have used the DEFLATE function to quote local file headers, and we just saw that this trick does not work in bzip2. However, there is an alternative way of quoting, somewhat more limited, which uses only functions of the zip format and is independent of the compression algorithm.

At the end of the local file header structure there is an additional variable-length field for storing information that does not fit into regular header fields ( APPNOTE.TXT, section 4.3.7). Additional information may include, for example, a time stamp or a uid / gid from Unix. Zip64 information is also stored in an additional field. The additional field is represented as a length-value structure; if you increase the length without adding a value, then the additional field will include what is behind it in the zip file, namely the next header of the local file. Using this method, each local file header can “quote” the following headers, wrapping them in its own additional field. Compared to DEFLATE there are three advantages:

Quoting through an extra field requires only 4 bytes, not 5, leaving more room for the kernel.
It does not increase the size of the files, which means a larger kernel size, given the limitations of the zip format.
It provides bzip2 citations.

Despite these advantages, quoting through additional fields is less flexible. This is not a chain, as in DEFLATE: each header of a local file must contain not only the next header itself, but all subsequent headers. Additional fields increase as you get closer to the beginning of the zip file. Since the maximum field length is 2 ¹⁶ −1 bytes, you can only quote up to 1808 local file headers (or 1170 in Zip64), assuming that the names are assigned as expected(in the case of DEFLATE, you can use the additional field to cite the first (shortest) headers of local files, and then switch to the citing DEFLATE for the rest). Another problem: in order to conform to the internal data structure of the additional field, you must select the 16-bit tag for the type ( APPNOTE.TXT, section 4.5.2 ) preceding the citation data. We want to select a type tag that will cause parsers to ignore data in quotes, rather than trying to interpret them as meaningful metadata. Zip parsers should ignore tags of an unknown type, so we can choose tags randomly, but there is a risk that in the future some tag will break the compatibility of the structure.

The previous diagram illustrates the possibility of using additional fields in bzip2, c and without Zip64. On both graphs there is a break point when growth goes from quadratic to linear. Without Zip64, this happens where the maximum uncompressed file size is reached (2 ³²−2 bytes); then you can increase only the number of files, but not their size. The graph completely ends when the number of files reaches 1809, then we have a place in the additional field for citing additional headers. With Zip64, a fracture occurs on 1171 files, after which only the size of the files can be increased, but not their number. The additional field helps in the case of DEFLATE, but the difference is so small that it is not visually noticeable. It increases the compression ratio of zbsm.zip by 1.2%; zblg.zip by 0.019%; and zbxl.zip by 0.0025%.

Discussion

In their work on this topic, Pletz and colleagues use an overlap of files to create an almost self-replicating zip archive. The file overlap itself was suggested earlier (slide 47) by Ginvael Coldwind.

We have developed a zip-bomb design with overlapping citations with regard to compatibility - a number of differences in implementations, some of which are shown in the table below. The resulting construction is compatible with zip-parsers that work in the usual way, that is, first checking the central directory and using it as an index of files. Among them, a unique zip-parser from Nailwhich is automatically generated from formal grammar. However, the construction is incompatible with “streaming” parsers that analyze the zip file from beginning to end in one run without first reading the central catalog. By their nature, streaming parsers do not allow file overlap in any way. Most likely, they will extract only the first file. In addition, they may even give an error, as in the case of sunzip , which analyzes the central directory at the end and checks it for consistency with the headers of the local files that it has already seen.

If you want the extracted files to start with a specific prefix, different from the bytes of the local file header, you can insert a DEFLATE block in front of the uncompressed block that quotes the next header. Not every file in a zip archive should be involved in creating a bomb: you can include regular files in the archive, if necessary, to fit some special format (in the source code there is a parameter --templatefor this use case). Many formats use zip as a container, for example, Java JAR documents, Android APK and LibreOffice.

PDFMuch like a zip. It has a cross-reference table at the end of the file that points to previous objects, and it supports compressing objects through the FlateDecode filter. I haven't tried it, but you can use the idea of overlap citation to make a PDF bomb. Perhaps there is not even much work to do: binaryhax0r writes in a blog that you can simply specify several FlateDecode layers on one object, after which creating a PDF bomb becomes trivial.

Determining the specific class of the zip-bomb described in the article is easy: just find overlapping files. Mark Adler wrote a patchfor Info-ZIP Unzip, which does exactly that. However, in general, blocking overlapping files does not protect against all classes of zip bombs. It is difficult to predict in advance whether the file is a zip-bomb or not, if you do not have accurate knowledge of the internal components of the parsers that will be used to analyze it. Peering into the headers and summation of the “uncompressed size” fields of all files does not work , because the value in the headers may not coincide with the actual uncompressed size (in the compatibility table, see the line “allows too small file size”). Reliable protection against zip-bombs includes limits on time, memory and disk space in a zip-parser during its operation. Take parsing of zip-files, as well as any complex operations with unreliable data, with caution.

	Info-ZIP UnZip 6.0	Python 3.7 zipfile	Go 1.12 archive/zip	yauzl 2.10.0 (Node.js)	Nail examples/zip	Android 9.0.0 r1 libziparchive	sunzip 0.4 (streaming)
DEFLATE	✓	✓	✓	✓	✓	✓	✓
Zip64	✓	✓	✓	✓	-	-	✓
bzip2	✓	✓	-	-	-	-	✓
		-	✓	✓	✓	-	✓
CRC-32		-		✓	-	✓	-
	✓	-	-	-	-	-	-
2 ³² −1	✓	✓	✓	-	✓	✓	✓
2 ¹⁶ −1	✓	✓	✓	-	✓	✓	✓
overlap.zip		-	✓	✓	✓	-	-
zbsm.zip zblg.zip	✓	✓	✓	✓	✓	✓	-
zbxl.zip	✓	✓	✓	✓	-	-	-

zip-, , zip-. DEFLATE Zip64, , CRC 32- 16- .

Acknowledgments

Thanks to Mark Adler , Russ Cox , Brandon Enright , Marek Maykovsky , Josh Wolfe, and reviewers of the USENIX WOOT 2019 for comments on the draft of this article. Kaolan McNamara evaluated the impact of zip-bombs on the safety of LibreOffice.

A version of this article has been prepared for the USENIX WOOT 2019 seminar . Source code is available. Artifacts for presentation at the seminar are in the file zipbomb-woot19.zip .

Did you find a system that falls into one of the bombs? Did the bombs help you find a bug or make money in a bug search program? Let me know, I will try to mention it here.

LibreOffice 6.1.5.2

After renaming zblg.zip to zblg.odt or zblg.docx, the LibreOffice program creates and deletes a series of temporary files of about 4 GB each, trying to determine the file format. Ultimately, he finishes doing this and deletes the temporary files as they arrive, so the zip-bomb only causes a temporary DoS, without filling the disk. Kaolan McNamara responded to my error message.

Mozilla addons-server 2019.06.06

I tried zip-bombs against the local addons-server installation, which is part of the addons.mozilla.org software. The system processes the bomb gracefully, imposing a time limit of 110 seconds on extracting files. The zip-bomb expands as fast as the disk allows up to this time limit, but then the process is killed and the unpacked files are eventually automatically cleared.

UnZip 6.0

Mark Adler wrote a patch for UnZip to discover this class of zip bombs.

July 5, 2019: I noticed that vulnerability CVE-2019-13232 was assigned to UnZip. Personally, I would argue that the ability / inability of UnZip (or any zip parser) to handle this kind of zip-bomb necessarily represents a vulnerability or even a bug. This is a natural implementation that does not violate the specification, what can I say. The type of bomb in this article is just one type, and there are many other ways that you can puzzle the zip parser. As mentioned above, if you want to protect against resource depletion attacks, you should not attempt to list, detect, and block every single known attack; rather, it is necessary to establish external restrictions on time and other resources so that the parser does not get into such a situation, no matter what attack it encounters. There is nothing wrong with trying to detect and reject certain constructions as a first pass optimization,but you can not stop there. If you ultimately do not isolate and restrict unreliable data operations, your system is probably still vulnerable. Consider the analogy with HTML cross-site scripting: the right defense is not to try to filter out specific bytes that can be interpreted as code, but to avoid everything correctly.

Antivirus engines

Twitter user @TVqQAAMAAAAEAAA reports : "McAfee Antivirus on my test machine just exploded." I did not check it myself and I do not have such details as the version number.

Tavis Ormandi points out that in VirusTotal for zblg.zip there is a series of timeouts ( screenshot dated June 6, 2019 ): AhnLab-V3, ClamAV, DrWeb, Endgame, F-Secure, GData, K7AntiVirus, K7GW, MaxSecure, McAfee, McAfee-GVtiVirus, K7GW, MaxSecure, McAfee, McAfee-GVtiVirus, K7GW, MaxSecure, McAfee, McAfee-GVtiVirus, K7GW, MaxSecure, McAfee, McAfee-GVtiVirus, K7GW, MaxSecure, McAfee, McAfee-Mi-Vee-Virus -Edition, Panda, Qihoo-360, Sophos ML, VBA32. Results for zbsm.zip ( screenshot dated June 6, 2019) similar, but with a different set of engines that fell out in a timeout: Baido, Bkav, ClamAV, CMC, DrWeb, Endgame, ESET-NOD32, F-Secure, GData, Kingsoft, McAfee-GW-Edition, NANO-Antivirus, Acronis. Interestingly, there are no timeouts in the results of zbxl.zip ( screenshot dated June 6, 2019 ). Perhaps some antiviruses do not support Zip64? Several engines detect files as a kind of “compression bomb”. It is interesting to see whether they will still do this with small changes, such as changing file names or adding an ASCII prefix to each file.

Closing statement

It's time to put an end to Facebook. This is not a neutral job for you: every day when you come to work, you do something wrong. If you have a Facebook account, delete it. If you work on Facebook, be content.

And do not forget that the National Security Agency must be destroyed.

Source: https://habr.com/ru/post/459254/

All Articles