Should I put Gentoo for the sake of acceleration?

Maybe someone from you once heard: "I plan to put myself Gentoo, it will be better to use the capabilities of my processor and will squeeze the maximum out of it." Well, let's figure it out ...

What kind of optimizations are there for the processor?

Basically, this implies the use of additional instruction sets such as: MMX, SSE, AES and AVX when compiling applications. However, if you dig deep, there are other optimizations and not just for applications.
I selected the following optimization groups:

Code optimization
- Code optimization when compiling for additional x86 : MMX, SSE, AES, ATA, AVX etc. instruction sets .
- Optimization of the code during its static analysis during compilation: deployment of tail recursion, removing unused portions of code, ignoring meaningless conditions, etc.
- Optimization for better cache hit.
Kernel-level code optimization: cryptographic methods from the Cryptographic API .

Optimizations for additional instruction sets are best covered on the page: Intel 386 and AMD x86-64 GCC Options . Starting with the Pentium MMX, MMX became available to us, then AMD made 3DNow!, Then SSE appeared in the Pentium III, and it went off. Intel Haswell, which pleased us this year, is supported: MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA, BMI, BMI2 and F16C.
Working with real numbers ( FPU ) also indirectly refers to additional sets of instructions, because the compiler can use SSE for this. This is faster than x87 instructions and does not block MMX. More about this will be below.
')
It should be noted another important feature. When the abbreviations SSE, MMX and AES are mentioned out loud, then very often people familiar with these concepts have a picture in their head that tells about the hard life of C programmers who assemble these inserts into their software. In fact, there are already 3 ways to use these instruction sets: automatically by the compiler for static code analysis, manually using special compiler functions and manually assembler inserts (for example: How to optimize the code for MMX processors ). In which cases GCC automatically uses instruction sets, if they are allowed, is not clear, but the manual clearly states that there are such cases (hopefully, in the comments they will write).

Static analysis code optimization is best highlighted on the Options That Control Optimization page. There are a lot of possibilities, but in order not to get confused, they are grouped into meta flags: O0, O1, O2, O3, Ofast. More information on these flags can be found below in linked articles.

Optimization for better cache hit . It’s not possible to quickly explain, so I refer the reader to another article: Bubbles, caches, and predictors of transitions . I can only say that programmers can use Intel VTune Performance Analyzer and AMD CodeAnalyst to analyze the places that can be optimized. And, it seems, ICC Intel C ++ compiler is able to do such optimizations in some cases automatically, and how do the GCC today with this, I hope knowledgeable people will add in the comments.

Optimize code at the kernel level . Allow to speed up functions in the Cryptographic API Framework, such as AES encryption, Twofish and others, using additional instruction sets, such as: SSE, AVX, AES. These functions can be used in other kernel modules, as well as being called outside of applications.

With the theory figured out, let's move on to how it is used.

If you have Ubuntu

Suppose you are sitting on Ubuntu. Depending on the bitness of the operating system, you have a choice of packages with the suffix i386 or amd64 ( example ). i386 does not mean that the package will work on any processor, starting from 386, it simply means that the target destination of the package is a 32-bit x86 platform. In turn, amd64 means support for the x86-64 64-bit platform. We can easily check this if we type in the console:

gcc -dumpmachine

On a 32-bit Ubuntu 12.04 LTS Server, we will see i686-linux-gnu , and on a 64-bit one, we should see x86_64-linux-gnu .
Suppose you have a 32-bit Pentium 4, MMX, SSE and SSE2 are available to you, but they were not used when generating packages, since the same packages should work on Intel Celeron, where there is only MMX, and maybe even on Pentium Pro, where there is no even MMX.
Additional instruction sets will be used only in packages, which themselves determine the processor on the fly and include a faster algorithm for this processor. The good news is that this happens in almost all multimedia packages.
It is also not clear with what code optimizations 32-bit Ubuntu was going to. If you look at the output of GCC, that is, a little from -O1, and from -O2 and from -O3. If packages for Ubuntu for a specific version can be built on the system itself with the default compilation options, then apparently they are not being assembled in the most optimal (of the rational) way.
Finally, the functions in the kernel Cryptographic API are not optimized. Optimized functions for additional instruction sets are present in the system only as modules, and only for i586 and AES (for VIA Nano), but not loaded by default. It is also not clear that out of 586 you can use for optimizations.

In Ubuntu 12.04 64-bit, things are much better. First: the default gcc for 64-bit systems uses extensions: MMX, SSE, SSE2, so the code can be somewhat optimized. Secondly, for x86-64, the default is -mfpmath = sse , which speeds up the arithmetic for real numbers.
Optimized functions of the kernel Cryptographic API for additional instruction sets are present in the system in modules, but are not loaded by default. At least they can be included.
Finally, gcc builds packages with the same strange set of optimizations as for Ubuntu 32-bit.

If you have Gentoo

Then most likely you read this page from the manual . So you have set yourself -O2 and -march = native (or the correct processor). But most likely you are the first: you didn’t go into the Cryptographic API when setting up the kernel and didn’t speed up some instructions, and at least you should speed up AES. Secondly: you most likely did not set USE flags for additional processor instructions from those that are available to you: 3dnow, mmx, sse, sse2, sse3. Or put not all of them. And this means that for applications intentionally pushing activation activations into USE flags, you are left without additional acceleration.

In addition to global flags, there are also local flags that use additional instructions for some applications. Such as: 3dnowext, ssse3, sse4, sse4_1, avx, avx128fma, avx256 and aes-ni. All that you support is better to set too.

The modern stage3 under amd64 exposes by default: bindist, mmx, sse, sse2. Unfortunately, bindist disables additional instructions in some packages for portability. If you need a bindist, use additional cpudetection to level the flaws of the bindist flag in some applications.

In which packages of Gentoo can I get a boost?

app-arch / libzpaq
app-emulation / bochs
media-libs / freeverb3 (audio)
media-libs / libpostproc (video)
media-libs / libvpx (video VP8)
media-plugins / vdr-softdevice (video)
media-sound / mpg123
media-video / ffmpeg
media-video / libav
media-video / mplayer
media-video / mplayer2
media-video / vlc
net-libs / cyassl
net-misc / bfgminer (bitcoin)
sci-biology / raxml
sci-libs / fftw
sci-chemistry / gromacs
sys-fs / loop-aes
x11-libs / pixman

Additionally, the orc flag, which is already set by default, helps to use additional processor instructions in:
media-libs / gstreamer (audio + video)

Additionally, the cpudetection flag, which is not set by default, helps to use additional processor instructions on the fly in:
media-sound / jack-audio-connection-kit
media-video / ffmpeg
media-video / libav
media-video / mplayer
media-video / mplayer2
sci-libs / mpir

findings

In 32-bit systems, the greatest gain from Gentoo can be obtained on the latest processor models.
In 64-bit systems, the gain from Gentoo can only be obtained by using newer versions of the compiler and optimizing -O2.
Even Gentoo, even after reading the official documentation does not interfere with readjusting.
Not Gentoo can be sped up too.

Additional material

ICC material

Add-on : Kekekeks user shared a recipe on how to rebuild some packages for Ubuntu with optimization.

Source: https://habr.com/ru/post/186098/

All Articles