Using GPGPU for data compression (Part I)

Hello, dear Habra community.

Many have probably already heard about GPGPU computing (video cards), currently there are many implementations of this programming technique. We will focus on two of them - this is notorious CUDA from the company Nvidia, and I think slightly less popular, but also the well-known OpenCL framework. On Habré there are already a sufficient number of articles that describe the basic principle of operation of these technologies, so we will not focus on this. In the article, I want to share the results obtained using GPGPU versus CPU for data compression.

Prehistory

I have always been interested in parallel computing, so at the institute, as a scientific work, I chose this particular direction. When it was necessary to choose the topic of the thesis, after reading various articles on GPGPU and increasing productivity, I decided that the topic of the diploma would be related specifically to GPGPU. I decided to try on data compression (you can say that this is not the best task for GPGPU), the static Huffman algorithm for data compression was chosen as the algorithm.

Standard Huffman Algorithm for Data Compression

For those who are not familiar with the algorithm, I will give a small example of his work.
Consider the algorithm on the example of the following sequence of characters:
aaaaaaaccaabccbbbb
Step 1:
It is necessary to read the file completely and count how many times each character from the extended ASCII set is encountered.
For our example, the frequency of occurrence of characters is respectively equal to:
a - 9, b - 5, c - 4.
Step 2:
After counting the frequency of occurrence of each character, it is necessary to form a binary tree, consider the algorithm for constructing a tree by example.

Our sequence consists of three characters, we arrange these characters in decreasing order of their frequencies: {(a, 9), (b, 5), (c, 4)}.
On the next line, write the set obtained from the previous one as follows:
- instead of the last two characters with their frequencies, we write the new element, which instead of the symbol name will have the entry “Vertex #n”, and instead of the frequency - the sum of the frequencies of the last two elements of the previous set;
-Sort the resulting set in descending order.
{(a, 19), ("vertex # 1", 9)}.
Repeat the previous step until the set consists of only one element: (“vertex #last_number”, summa_vseh_chastot):
{("Vertex # 2", 28)}

We got a binary tree, where the root is the last vertex.
Step 3:
Using the resulting tree, we can encode the file. To do this, select the new bit sequence for the characters, we move up through the nodes of the tree and add 1 for each right node, and for the left 0. In our example, the characters will be encoded as follows:
a - 0, b - 10, c - 11,
Note that before compression according to the ASCII table, the characters had the form:
a - 01100001, b - 01100010, c - 01100011.
When coding, we replace characters with the given sequences.
To be able to decode the file, we must save the tree to a file. Decoding is carried out by passing through the tree from the root to the leaves.
')

GPGPU Compression Algorithm

The encoding program is logically divided into the preprocessor (step 1.2 of the algorithm) and the actual coding (step 3). Step 1.3 of the algorithm were rendered on GPGPU.
The first two preprocessing stages are standard when implementing the Huffman method:

counting the frequencies with which different characters of the alphabet are encountered;
ordering the array of frequencies;

Then you need to build a Huffman tree (step 2). In our case, instead of the Huffman tree, an additional data structure is constructed which contains:

array of character codes;
array of lengths of character codes.

The decision to use this data structure replacing the standard Huffman tree was made because none of the technologies used allows a full-fledged tree to be implemented on GPGPU.
Of all the preprocessing steps described above, only the counting of symbol frequencies depends on the size of the input data. The remaining preprocessing steps are elementary steps.

results

C # was chosen as the programming language. For comparison, two versions of the program for the CPU, CPU (Async) —a parallel version and CPU (Sync) —second version, and two versions for GPGPU (OpenCL and CUDA) were written.

The following equipment was used for testing:

Processor (CPU): AMD PHENOM X4 965 Black Edition;
graphics card: MSI GeForce GTS 450.

To measure the speed of the programs, two files were selected: one 200 MB in size (210,596,352 bytes), the second 603 MB (633,128,511 bytes). To measure time, use the Stopwatch () class, which is available in C #. The accuracy of the time measurement depends on the frequency of the processor and theoretically, on the computer used, it is equal to 1/36902109016524975 = 3.0e-16.
The measurements were performed ten times for each version of the program and for each file. In fig. 1-4 are graphs showing average time based on measurements.

Average program execution time on a file of 200 MB in size (without recording)

Average program execution time on a file of 200 MB in size (without recording)

Picture 1.
In this test, OpenCL and CUDA exceed the speed of CPU (Sync) by eight and a half times, and CPU (Async) by two and a half times.

Average program execution time on a file of 200 MB in size (with a record)

Average program execution time on a file of 200 MB in size (with a record)

Figure 2.
In this test, OpenCL and CUDA exceed the speed of the CPU (Sync) by four and a half times, and the CPU (Async) by one and a half times.

Average program execution time on a file of 603 MB in size (without recording)

Average program execution time on a file of 603 MB in size (without recording)

Figure 3.
In this test, OpenCL and CUDA exceed the speed of CPU (Sync) by eight times, and CPU (Async) by two and a half times.

Average program execution time on a file of 603 MB in size (with a record)

Average program execution time on a file of 603 MB in size (with a record)

Figure 4.
In this test, OpenCL and CUDA exceed the speed of the CPU (Sync) by four and three times, and the CPU (Async) by one and three times.

Conclusion

CUDA and OpenCL have proven their effectiveness in testing versus CPU.
Based on the results, it can be concluded that for tasks that do not use a large number of mathematical calculations, the running time of programs written using OpenCL is equal to the running time of programs written using CUDA. In the second part of the article, I will talk about the technical aspects of implementation.

Source: https://habr.com/ru/post/124210/

All Articles