Compress mobile graphics to ETC1 format and open utility

With the development of free-to-play mobile games, new graphics are regularly added along with new features. Part of it is included in the distribution, part is downloaded during the game. To enable the application to run on devices with a small amount of RAM, developers use hardware-compressed textures .

The ETC1 format is obligatory for support on all Android devices with OpenGL ES 2.0 and is a good starting point for optimizing the consumed RAM. Compared with the PNG, JPEG, WebP formats, the ETC1 textures are loaded without intensive calculations using regular memory copying. The performance of the game is also improved due to the smaller size of the texture data sent from slow to fast memory.

On any device with OpenGL ES 3.0, it is possible to use textures in the ETC1 format, which is a subset of the improved ETC2 format .
')

Using ETC1 Compressed Textures

The ETC1 format contains only RGB color components, so it is suitable for opaque backgrounds that are recommended to be drawn with Alpha-blending disabled.

What to do with transparent graphics? For it, we use two ETC1 textures (hereinafter - 2xETC1):

- in the first texture we store the original RGB;
- in the second texture we store the original alpha (hereinafter - A), copying it into the RGB components.

Then in the 2xETC1 pixel shader, we restore the colors in this way:

uniform sampler2D u_Sampler; uniform sampler2D u_SamplerAlpha; varying vec2 v_TexCoords; varying vec4 v_Color; void main() {    vec4 sample = texture2D(u_Sampler, v_TexCoords);    sample.a = texture2D(u_SamplerAlpha, v_TexCoords).r;    gl_FragColor = sample * v_Color; }

Features of the preparation of atlases before compression in ETC1 format

The ETC1 format uses independent 4x4 pixel blocks, so it is desirable to align the position of the elements placed in the atlas by 4 pixels to prevent different elements from entering the common block.

All elements when placed in the atlas slightly increase in area, because they need an additional protective frame with a thickness of 1-2 pixels. This is due to the fractional rendering coordinates (with the smooth movement of sprites) and with bilinear texture filtering. The mathematical justification for the causes of what is happening deserves a separate article.

In the case of polygonal atlases, the elements are divorced to an acceptable distance. All 4x4 ETC1 blocks consist of a pair of 2x4 or 4x2 stripes, so even a distance of 2 pixels can have a good insulating effect.

How can you compress quality in ETC1 format?

There is a choice among free utilities:

- ETC2Comp ;
- Mali GPU Texture Compression Tool ;
- PVRTexTool ;
- rg-etc1 .

For high-quality graphics compression, you have to set a perceptual metric that takes into account perceptual features, as well as choose the slowest modes best and slow. Having once tried to compress the texture 2048x2048 qualitatively, you understand that this is a long process ... Perhaps that is why many developers are limited to fast medium and fast alternatives. Is it possible to do better?

The history of creating EtcCompress from scratch by one of Playrix’s programmers began in January 2014, when the final graphics compression in ETC1 format exceeded the three-hour trip to visit.

Ideas quality compression format ETC1

The ETC1 format is a format with independent blocks. Therefore, we use the classical approach of compressing individual blocks, which is well parallelized. Of course, you can try to improve the docking of the blocks, considering the sets of blocks, but in this case you will need information about belonging to the elements of the atlas and the computational complexity of the problem increases dramatically.

The dssim utility is suitable for comparing compression results.

For each block, you will have to go through all 4 possible encoding modes to find the best CompressBlockColor function in the code:

- two 2x4 strips, each having its own 4-bit base color, in the code calls CompressBlockColor44 (..., 0);
- two 4x2 strips, each with its own 4-bit base color, in the code calls CompressBlockColor44 (..., 1);
- two 2x4 stripes, the first one having a basic 5-bit color, the second one differing from the first color in the 3-bit range, in the code there are CompressBlockColor53 calls (..., 2);
- two 4x2 stripes, the first one having a basic 5-bit color, the second one differing in the base color from the first one in the 3-bit range, in the code, calls to CompressBlockColor53 (..., 3).


2x4, 444 + 444	4x2, 444 + 444	2x4, 555 + 333	4x2, 555 + 333

Speaking of an error, many utilities use the classic PSNR . We also use this metric. Choose weights from the table .

 PixelError = 0.715158 * (dstG - srcG)^2 + 0.212656 * (dstR - srcR)^2 + 0.072186 * (dstB - srcB)^2

We turn to the integer values by multiplying the coefficients by 1000 and rounding. Then the initial 4x4 block error will be kUnknownError = (255^2) * 1000 * 16 + 1 , where 255 is the maximum error of the color component, 1000 is a fixed amount of weights, 16 is the number of pixels. This error fits into int32_t . It can be noted that integer quadration is close in meaning to accounting for gamma 2.2 .

PSNR has weak points. For example, encoding a fill color c0 choosing from the palette c1 = c0 - d and c2 = c0 + d introduces the same error d^2 . This means a random choice between c1 and c2 involving all sorts of checkers.

To improve the result, the final calculation in the block will be performed by SSIM . In the code, this is done in the ComputeTableColor function using SSIM_INIT, SSIM_UPDATE, SSIM_CLOSE, SSIM_OTHER, SSIM_FINAL macros. The idea is that for all solutions with the best PSNR (in the found encoding mode), the solution with the highest SSIM is chosen.

For each block coding mode, you will have to go through all possible combinations of basic colors. In the case of independent base colors, the CompressBlockColor44 function performs independent compression of strips with two calls to GuessColor4.

The GuessColor4 function iterates over the deviations and the base color component:

 for (int q = 0; q < 8; q++)   for (int c0 = 0; c0 < c0_count; c0++) // G, c0_count <= 16       for (int c1 = 0; c1 < c1_count; c1++) // R, c1_count <= 16           for (int c2 = 0; c2 < c2_count; c2++) // B, c2_count <= 16               ComputeErrorGRB(c, q);

In the case of dependent base colors, the algorithmic complexity increases due to the double nesting of the loops of the strips. The CompressBlockColor53 function performs enumeration overruns.

 for (int qa = 0; qa < 8; qa++)   for (int qb = 0; qb < 8; qb++)       AdjustColors53(qa, qb);

The AdjustColors53 function iterates through the components of two basic colors:

 for (int a0 = 0; a0 < a0_count; a0++) // G, a0_count <= 32   for (int a1 = 0; a1 < a1_count; a1++) // R, a1_count <= 32       for (int a2 = 0; a2 < a2_count; a2++) // B, a2_count <= 32           ComputeErrorGRB(a, qa);           for (int d0 = Ld0; d0 <= Hd0; d0++) // G, d0_count <= 8               for (int d1 = Ld1; d1 <= Hd1; d1++) // R, d1_count <= 8                   for (int d2 = Ld2; d2 <= Hd2; d2++) // B, d2_count <= 8                       b = a + d;                       ComputeErrorGRB(b, qb);

The presented exhaustive search is not faster than the best compression modes of similar utilities, but this is our complete exhaustive search, which will be greatly accelerated further.

In the case of 2xETC1 graphics, fully transparent pixels in the general case can have an arbitrary RGB color, which will be multiplied by zero alpha.

We can ignore the insignificant pixels, so let's filter them at the very beginning, in the code, these are the calls to FilterPixelsColor. On the other hand, not every transparent pixel is insignificant, we recall at least a protective frame of 1-2 pixels and the effect of bleaching borders .

Therefore, we will make a stencil in which zero means an insignificant pixel, and a positive value will show a significant pixel. The stencil is created on the basis of channel A by applying a stroke, usually 1 or 2 pixels in size, in the code this is the OutlineAlpha function.

As practice has shown, when using a stencil, the compressed borders of objects are improved, and the invisible blocks quickly take on a well-packaged zip black color. It is the idea of the stencil that gives a noticeable gain in quality in comparison with separate compression of RGB and A, including the listed utilities.

Thus, 2xETC1 compression can be represented by the following steps, implemented in the EtcMainWithArgs function:

1) compress channel A to ETC1 format;
2) unpack the compressed channel A back;
3) make the stroke visible, where A> 0, getting a stencil;
4) compress the RGB channels in the ETC1 format, taking into account the stencil.

Ideas for speeding up quality compression in ETC1 format

In order for the utility to find its use, besides the quality of the result, the time of work is also important. The considered partial block compression algorithm is worthy of a quick initial heuristic estimate and useful cut-offs during work, including those based on greedy algorithms.

For a format with independent blocks, incremental compression is easily implemented. For example, when the previous compression results are saved.

In this case, the packer tries to read the output file, unpack it and calculate the existing error; this will be the initial solution. If there is no file, then the initial solution is taken from zeros. In the code, this is LoadEtc1, CompressBlockColor, MeasureHalfColor.

Subsequent steps should attempt to improve the existing solution with algorithms of increasing complexity. Therefore, first, fast CompressBlockColor44 is called, and only then slow CompressBlockColor53. Such a chain construction in the future will allow the integration of compression into the ETC2 format.

Before the beginning of the search by nested loops, it makes sense to find a solution in the context of color components. The fact is that the best solution cannot have an error smaller than the total error of the best solutions for each of the components G, R, B. Often, the resulting error will be significantly larger, which characterizes the nonlinearity and complexity of the ETC1 algorithm.

Solutions in terms of color components are represented by GuessStateColor and AdjustStateColor structures. For each value from the deviation table g_table, the errors of the Half bars are calculated and stored in the fields node0, node1, node2. And in GuessStateColor indexes [0x00..0x0F] store the calculated errors for all possible base colors g_colors4, and the index [0x10] is the best solution. For AdjustStateColor, the best solution is stored in the [0x20] index, all possible base colors are taken from g_colors5.

The error calculation for the color components is performed by the ComputeLevel, GuessLevels, AdjustLevels functions based on the g_errors4, g_errors5 tables previously calculated by the InitLevelErrors function.

It is worthwhile to search through the color components in ascending order of the error introduced by them; for this, the node0, node1, node2 fields are sorted by the functions SortNodes10 and SortNodes20.

To speed up the sorting itself, sorting networks are used, calculated on the thematic site .

Before performing the sorting, it makes sense to discard large errors that exceed the solution found. At the same time, the number of elements in the node0, node1, node2 fields significantly decreases, which significantly speeds up the sorting and further search.

You can try to cut off the third nested loop by color components G, R, B by finding the best solution for the current G, R with the ComputeErrorGR function, which is 2 times faster than the ComputeErrorGRB function. This, by the way, is a hot spot in the profiler.

In the dependent base color mode, good acceleration gives the search for the best solution for each half, because the error found often exceeds the optimistic forecast for the color components and at the same time is a cut-off criterion.

Walk and Bottom are doing this.

The 64 calls of the AdjustColors53 function can result in repeated calls of the ComputeErrorGR and ComputeErrorGRB functions with the same basic color parameters, so we will cache the results of the calls. In turn, to quickly initialize the cache, you can use lazy calculations on the third color component.

In the AdjustStateColor structure, the ErrorsG, ErrorsGR fields and the ErrorsGRB field, which are cleared by LazyGR, provide significant performance gains.

After various algorithmic improvements, it is time to use SIMD, in this case the solution was published on the integer SSE4.1. Data of one pixel is stored as int32x4_t.

The _mm_adds_epu8 and _mm_subs_epu8 commands are convenient for calculating a four-color palette from the base color and deviations.

In the functions ComputeErrorGRB and ComputeErrorGR, partially deployed cycles, optimized by the _mm_madd_epi16 command, are used first, since in most cases its capacity is sufficient. In the case of large errors, the second cycle works on the “slow” _mm_mullo_epi32 commands.

The ComputeLevel function calculates an error for four basic color values at once.

To compress one channel A, you can simplify the resulting RGB compression code. There will be noticeably fewer nested loops and better performance.

Results achieved

The described approaches allow to reduce the requirements for RAM in Android versions of games due to the use of compressed textures in the ETC1 hardware format.

In the scripts for the formation of atlases and the compression utility itself, attention is paid to the issues of preventing artifacts and improving the quality of compressed graphics.

Surprisingly, together with the improved quality of compressed graphics, we managed to accelerate compression itself! In our Gardenscapes project , the compression of atlases to the ETC1 format on an Intel Core i7 6700 processor takes 24 seconds. This is faster than the generation of atlases themselves and several times faster than the previous compression utility in fast mode. The proposed incremental compression occurs in 19 seconds.

In conclusion, I will give an example of compressing the 8192x8192 RGB texture presented by the EtcCompress utility under Win64 on an Intel Core i7 6700 processor:

 x:\>EtcCompress Usage: EtcCompress [/retina] src [dst_color] [dst_alpha] [/debug result.png] x:\>EtcCompress 8192.png 1.etc /debug 1.png Loaded 8192.png Image 8192x8192, Texture 8192x8192 Compressed 4194304 blocks, elapsed 10988 ms, 381716 bps Saved 1.etc Texture RGB wPSNR = 42.796053, wSSIM_4x2 = 0.97524678 Saved 1.png x:\>EtcCompress 8192.png 1.etc /debug 2.png Loaded 8192.png Image 8192x8192, Texture 8192x8192 Loaded 1.etc Compressed 4194304 blocks, elapsed 6487 ms, 646570 bps Saved 1.etc Texture RGB wPSNR = 42.796053, wSSIM_4x2 = 0.97524678 Saved 2.png x:\>fc /b 1.png 2.png   1.png  2.png FC:

We hope that the utility will help to quickly and efficiently compress mobile graphics.

Source: https://habr.com/ru/post/310484/

All Articles