High-performance DEFLATE compression with optimized genomic data sets

igzip is a high-performance library for performing gzip or DEFLATE compression. It was originally described in the DEFLATE High Performance Compression article for Intel architecture processors . This article describes a related release of source code that contains optional (at build time) optimizations to increase the degree of compression of genomic data sets in the BAM and SAM formats. igzip works about 4 times faster than Zlib when configured for maximum speed, and with about the same degree of compression for genomic data. We believe that igzip can be similarly optimized for other applications where data sets are different from regular text data.

Overview

You can build two main versions of igzip: igzip0c and igzip1c. The first one works faster, and the second one - a bit slower, but with a higher compression ratio. Both of these versions are much faster than standard gzip -1 or Zlib -1, but differ in a slightly lower degree of compression. These implementations use the SSE 4.2 instruction set. They are intended for Intel processors that support this instruction set.

In the article above, we described the compression ratio and igzip performance when working on general-purpose data. After that, we expanded the implementation to efficiently process the genomic data. In the computational process used in determining the genomic sequence, a serious limitation of performance is data compression. There are scenarios for use with long-term storage, when a very high degree of compression is required, especially if we take into account the considerable size of the genomic data sets. However, at various stages of genomic sequence determination, there are operations that require very fast compression with a moderate degree of compression. In this study, we focused on the two main data formats used to determine the genomic sequence; These are the SAM and BAM formats.
')
An analysis of these formats showed that the content of the data is very specific, is found only in genomic data and differs significantly from ordinary data (for example, from the data sets in the University of Calgary collection). We identified the following basic properties of genomic data: a very small length of matches when processing LZ77 (~ 90% matches less than 8 bytes long), very small offsets (~ 90% matches shifts less than 1024 bytes) and a very diverse distribution of literals (for example, SAM is a significant number of A, T, G, C characters from processed sequence lines). We changed the Huffman tables in igzip to take into account these features, and achieved approximately the same degree of compression as Zlib-1, but 4 times faster.

Programming interface

API:

struct LZ_Stream2 { UINT8 *next_in; // Next input byte UINT32 avail_in; // number of bytes available at next_in UINT32 total_in; // total number of bytes read so far UINT8 *next_out; // Next output byte UINT32 avail_out; // number of bytes available at next_out UINT32 total_out; // total number of bytes written so far UINT32 end_of_stream; // non-zero if this is the last input buffer LZ_State2 internal_state; }; void init_stream(LZ_Stream2 *stream); void fast_lz(LZ_Stream2 *stream);

The basic principle is that next_in points to the input buffer, and avail_in indicates the length of this buffer. Similarly, next_out points to an empty output buffer, avail_out indicates the size of this buffer.

The total_in and total_out fields start at 0 and are updated with the fast_lz () function. They correspond to the total number of bytes read or written at the current time in case such information is needed by the calling application.

The init_stream () function statically initializes the data structure of the stream, and in particular the internal state.
The fast_lz () call will take data from the input buffer (by updating next_in and avail_in ) and write the compressed stream to the output buffer (by updating next_out and avail_out ). The function returns when either avail_in or avail_out returns zero (that is, when the input data has run out or when the output buffer is full).

After the last input buffer has been transferred, the end_of_stream flag should be set. In this case, the procedure will process the bitstream to the end of the input buffer, if there is enough space in the output buffer.
Simple example

 LZ_Stream2 stream; UINT8 inbuf[LEN_IN], outbuf[LEN_OUT]; init_stream(&stream); stream.end_of_stream = 0; do { stream.avail_in = (UINT32) fread(inbuf, 1, LEN_IN, in); stream.end_of_stream = feof(in); stream.next_in = inbuf; do { stream.avail_out = LEN_OUT; stream.next_out = outbuf; fast_lz(&stream); fwrite(outbuf, 1, LEN_OUT - stream.avail_out, out); } while (stream.avail_out == 0); assert(stream.avail_in == 0); } while (! stream.end_of_stream);

The Zlib FLUSH_SYNC operation is not currently supported.

Source Code Versions

The options in the options.inc file determine which version will be built. This determines the number of characters in the syntax of YASM. The Perl script converts this file to options.h for use in code C. You should not manually edit the options.h file. (If Perl is not available, you can manually edit options.h, but then make sure that the options in the options.h and options.inc files match.)

As mentioned above, the two main versions are 0c and 1c. The version is selected by editing the options.inc file, which can be one of the following lines.

 %define MAJOR_VERSION IGZIP0C

 %define MAJOR_VERSION IGZIP1C

(In this release, IGZIP0 and IGZIP1 versions are not supported.)

In any of these versions, you can select one of three Huffman tables. The default table is intended for use with most files. There are two other tables optimized for use with genomic applications, in particular for SAM and BAM files. To select one of these tables, uncomment the relevant rows (removing the starting semicolon).

 ;%define GENOME_SAM ;%define GENOME_BAM

If both lines are commented out (as shown above), default tables are used. Do not uncomment both of these lines at the same time.

By default, igzip creates a GZIP file (RFC 1952). If string

 ;%define ONLY_DEFLATE

uncommented, then the output of the DEFLATE format will be generated, that is, without the GZIP headers.

Figure 1. Supported (tested) versions in this release

Code can be compiled under Linux or Windows. There is a makefile for Linux. The code combines C and assembler. To assemble the assembler code, use the YASM assembler. The path to the YASM in the Makefile can be changed according to the installation location of the YASM in the user environment.
You can collect the code (under Linux) as a static library, for example make lib0c or make lib1c, or as a shared object, for example, make slib0c or make slib1c.

results

This release of igzip supports not only two main versions (igzip0c and igzip1c), but also additional optimization for genomic data sets, in particular for BAM and SAM files. These files are quite different from regular files, and the prevalence of statistics for each type of these files means that you can optimize Huffman tables for these statistics and achieve a higher degree of compression.

To illustrate, the three test files were compressed with different configurations. One of the test files was a progl from the University of Calgary collection (called the Calgary Corps). It corresponds to a regular file. The other two files are arbitrary examples of genomic files of the BAM and SAM formats. In each test, the degree of compression was calculated as a quotient of dividing the size of a compressed file by the size of an uncompressed file (the smaller, the better). For brevity, only the compression ratio for the “preferred” version of igzip is shown (that is, when you run igzip0c-bam for a BAM file). As expected, the preferred version provides a higher compression ratio than other versions (for example, when running igzip0c for a SAM file instead of igzip0c-sam). In addition, for simplicity, we show igzip performance only for preferred data types. Performance is not too different due to the different Huffman tables in igzip, it depends more on the input data type.

Note. Tests and performance evaluations are carried out on specific computer systems and components and correspond to the approximate performance of Intel products in these tests. Any differences in system hardware and in software or configuration may affect actual performance. When choosing products to purchase, refer to other information and performance tests. For more information about performance tests and performance of Intel products, see here .

The performance results in this section were obtained when measured on an Intel Core i7 4770 (Haswell) processor. The tests were conducted with Intel Turbo Boost Technology turned off and represent performance without using Intel Hyperthreading (Intel HT) technology on a single core. We provide results for in-memory data, excluding file I / O.

We compiled Zlib implementations using the GCC compiler with aggressive optimization parameters. As a reference, we used Zlib version 1.2.8 for comparison.

Timing metrics were measured using the rdtsc () function, which returns a processor time stamp (TSC) counter. TSC is the number of clock cycles since the last reset. TSC_initial - the TSC value recorded before the function was called. After the function is completed, rdtsc () is called again to write the new TSC_final cycle counter. The actual number of cycles for the called procedure is calculated using the expression:
number of cycles = (TSC_final - TSC_initial) .

For each file, a significant number of such measurements were made, followed by averaging their results.

Figure 2. The difference in speed and compression ratio igzip0c * compared to gzip / Zlib -1

Table 1. Compression performance (cycles per byte) and compression ratio

The table shows the compression ratio and performance for the “preferred” version of igzip for the corresponding file type. It is seen that, in particular, for test files BAM and SAM, the compression ratio of igzip is very close with zlib -1. For SAM files, the compression rate and speed turned out to be the same as for some regular files from the Calgary University dataset, whereas for the BAM format, both the compression rate and speed are lower than for SAM (this applies to both igzip and Zlib). In the BAM format, a significant part of these fields is already packaged in binary format, which is why compression using LZ77 algorithms is insignificant. The results show that the igzip speed is about 4 times faster than Zlib (when set to the maximum speed in both cases).

Conclusion

igzip is a library for high-speed compression using the DEFLATE / gzip algorithm. Its various configurations are possible with a different ratio of compression ratio and speed. In addition, there are additional optimization options for BAM and SAM genomic files. At optimization, the compression ratio is close to gzip -1, but the speed is much higher. After examining the SAM and BAM formats, we believe that it is possible to analyze various data sets from other applications, which can differ quite a lot from standard text data, and develop customized igzip implementations with similar results.

For more information on compiler optimization, see our optimization notice .

Download igzip source code.

Source: https://habr.com/ru/post/257273/

All Articles