Fat programs - speed factors

Picture from the movie “Size Matters”, 2009

This article was launched in April 2016 as a result of the fact that the computer again began to work more slowly than I click the mouse. Actually, it is a compilation of many tests (some since 2010) and discussions with my participation. It cannot be called fully completed, since these are not final conclusions, but some intermediate points, showing what to pay attention to and where to dig further.

The name is partially borrowed from Niklaus Wirth's article “Down with the“ Fat ”Programs” , which was exactly 10 years old in 2016, and has not lost its relevance - but rather reached a new level, who do not know it - read it.
')
Consider various aspects that affect the performance of systems and programs.

Language aspect
Memory aspects
Aspects of the real world
Non-language factors
Human aspect

Language aspect

a) Influence of programming language and compiler

As I said earlier in the comments, using a “good” language gives a decent difference in speed.

We take the good old S. as a point of reference. It is clear that by using enough diligence, you can do better in assembler, but let the assembler world and the associated fine settings of higher orders remain an unattainable ideal.

Consider the evaluation of the aspect in two variants - a business application that is little dependent on floating-point calculations, and a purely computational problem.

Next, go through the gradations between them.

The test for integer calculations is Dhrystone (I know that it is old and bad, but it is quite suitable for estimating the depth of a hole — it has a good set of fundamental operations of the language).

Copies and modified sources I posted here github.com/Siemargl/FatProgs
The extreme points of the segment are an optimized C program and an interpreted Python3.6 program (not because I don’t like Python, but because I lost it all on previous similar tests and is point B, and there are already the necessary tests for it)

Dhrystone results

> gcc -O2 -DTIME -DHZ = dhry_1.c dhry_2.c -o gcc_dry2

D: \ VSProjects \ _pl_benchmark \ Dhrystone> gcc --version

gcc.EXE (tdm-1) 4.9.2
> gcc_dry2 500000000
User time = 27 sec
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 18518518.0
DMIPS 10539.9

> python.exe pystone.py

Pystone (1.2) time for 5000000 passes = 46.4937
This machine benchmarks at 107542 pystones / second

The difference in speed is 172 times. If you look at the source, then the pistons are the original in the original.

To exclude fraud, take another TinyC - which does not know how to optimize in principle

> cc_dry2.exe 500000000

User time = 55 sec
Microseconds for one run through Dhrystone: 0.1
Dhrystones per Second: 9090909.0
DMIPS 5174.1
“Total” is 84 times faster than the interpreter, and twice as slow as the optimized version.

The test of floating point calculations, for example Scimark2 (it is also old and is still used, for example, in the performance package

Scimark results

Running just this test looks like this.

D: \ VSProjects \ Python36 \ python.exe pyperformance run -o py3z.json -b = scimark

Python version: 3.6.0 (64-bit) revision 41df79263a11
Report on Windows-7-6.1.7601-SP1
Number of logical CPUs: 4
Start Date: 2017-01-19 03: 03: 35.583496
End date: 2017-01-19 03: 07: 00.042133
### scimark_fft ###
Median + - std dev: 863 ms + - 4 ms
### scimark_lu ###
Median + - std dev: 462 ms + - 10 ms
### scimark_monte_carlo ###
Median + - std dev: 247 ms + - 2 ms
### scimark_sor ###
Median + - std dev: 577 ms + - 6 ms
### scimark_sparse_mat_mult ###
Median + - std dev: 10.3 ms + - 0.1 ms

Results for the same with Python by the number of iterations of the C test
> gcc -mfpmath = sse -march = native -O2
FFT ms * 50: 2.06 (N = 1024)
SOR ms * 10: 0.59 (100 x 100)
MonteCarlo: ms * 1e5: 1.51
Sparse matmult ms: 0.79 (N = 1000, nz = 50000)
LU ms: 0.32 (M = 100, N = 100)

Decoding - The Python test shows the time for one test execution, but which may contain a different number of identical iterations, for example, the Fourier transform is considered 50 times, which is reflected by a multiplier in the C-results.

Here, the difference on different tests can reach 1500 times (I don’t like it, it looks suspiciously big - so those who want to find errors in testing are welcome)

Turning off optimizations and SIMD on C gives 3-4 times slower result
FFT ms * 50: 8.80 (N = 1024)
SOR ms * 10: 0.66 (100 x 100)
MonteCarlo: ms * 1e5: 4.04
Sparse matmult ms: 2.87 (N = 1000, nz = 50000)
LU ms: 1.11 (M = 100, N = 100)

Now about the gradation.

If you do not take into account pure interpreters like python and old PHP, all compilers and most JIT machines fit into the x2-x4 range, with the best JIT in Java and lagging behind even in the worst cases less than 2 times, and wrong compilation options. -the program for calculating FFT, for example, slows down fivefold.

On the other hand, if you look here , in the modern world of Javascript, the interpreter is much closer to you than you think.

The way to measure this factor is given above.

Evaluation. Since there are still many factors ahead, I would rate this speed factor as insignificant, except for a narrow circle of purely computational problems.

b) The influence of the programming paradigm

For example, C ++ / D. These languages add two basic paradigms - OOP and template metaprogramming. I will not sing praises to them - let's see how this affects performance.

Using metaprogramming. Templates, macros, partly generics (although objects belong to the worst karma). The templates will unfold into code without loss of performance or even with a gain in speed, since the functions of inline and call losses disappear. All that threatens to use them is code bloating due to the generation of copies of the code of the same algorithms for different types of data.

Using OOP. The problem here is that as soon as you took the object from the framework, it “called” your dad, mom, and all relatives to join your program, even if they are not used in your program (linkers and class-loaders are not as smart as I would like to). Together they will eat up the time to boot from disk, place in memory and in the cache of the processor. This is also a memory aspect.

I almost forgot - some technology introduced in addition to the PLO, such as exceptions, or other advanced, is also not entirely free.

How to measure the aspect - there are Stepanova Tests, showing acceleration up to 2 times when deploying calls and losing up to 2% on virtual calls, but the impact of code bloating is extremely difficult to assess, see the memory aspect below.

The estimation of this speed factor is from insignificant to noticeable.

c) Byte Code Machines (BCVM)

They are in the middle of performance between compilers into machine code and interpreters, close to the first.

The essential difference here is whether the intermediate bytecode will be compiled by the JIT or AOT, or will be interpreted. A mixed version is also possible, when BCVM was designed for another language and part of the code cannot be translated into the PI code of this machine.

In any case, we pay either load time or memory. For example, Java JIT costs about 100MB of memory, which is already significant, although the resulting code is very fast and uses SIMD (tested recently in the article about FFT)

How to measure - use tests, it is often enough to look on the Internet for something like this

Evaluation of this factor of speed - if JIT is good, then it is insignificant, but the memory aspect is a plus, and if a byte machine is an interpreter or “alien”, then alas.

Memory aspects

a) Stacking placement

- it does not cost anything, but the reference to the dynamic memory manager often pulls the system call. Coupled with garbage collection and memory fragmentation can be a problem for 24/7 systems.

If your language allows temporary local objects to be created without using a heap, you will benefit.

Measured by tests, by itself costs a little, but can pull the aspects below.

b) Garbage Collector (GC)

- it is not clear, a blessing for newcomers or a curse for the “big ones” - but since it is sometimes necessary to fight it, a negative factor.

Measurement by tests is difficult, it requires complicated diagnosis of the behavior of GC using BCVM tools.

Estimation of the factor - for large systems, significant unpredictable lags may occur.

c) dynamic loading

When we launch a new class - for example, we open a new window in the program, two things happen - we need to load a piece of related code in the case of machine code or a framework in the case of BCVM- (we look at OOP and the mother of all relatives) and run JIT on the loaded code.

Check for BCVM in reality - you can do a simple test - we launch the file load and intensively climb the interface - and now your program works with swap speed.
Full testing is difficult, it is easy to measure only the load time and look at the maximum potential appetites of the program - the size of the VSZ memory for Linux and the “Allocated Virtual Memory” counter in Windows. A little more help learning statistics hard page faults.

It seems that the problem should be solved by the amount of cheap memory and preloading, but if one program consumes an extra two or three or ten MB, then in a typical system there are dozens of them - and oh.
Impact assessment - possible unpredictable lags.

d) CPU cache

The more code that is executed and the data is processed by yours and not only by your program, the stronger is the influence of the cache size of the processor. And the cache is required for numerous templates and generics, and for JIT.

Testing - you need to drive the same program on processors of the same type with different cache sizes, which is not available to everyone.

Impact - difficult to assess. This article argues that localization of data in the cache accelerates significantly.

Aspects of the real world

a) System calls

On different platforms, a system call can have different costs, such as mutexes or thread creation. Because the difference in speed can be at times only because of this factor. One example recently on Habré.

Measurement - testing on different platforms. Impact assessment - approximately as for choosing a programming language, as insignificant, except for specific tasks.

b) How important is abstract performance?

In most cases, the real iron that you have to work with is always slower. Either the speed is quite enough for the user's reaction. Or your site is not visited enough for delay.

c) Iron affects

Perhaps you have an ARM processor and everything is bad floating point.
Perhaps you have tight memory limitations do not allow to accommodate a complex algorithm.
Perhaps a slow or unstable network.
Maybe the video accelerator is very specific in functionality.
Requires preliminary testing on the layout.
The impact can be very strong.
Evaluation - for example, this is your battery in the phone, which has to feed the memory and gigahertz.

Non-language factors

Algorithms decide everything. For example, SQL loses its linguistic computational characteristics (this is a common interpreter), but because of the good hardware and long data access time, it looks good overall. However, in some specific cases, a too generalized SQL approach loses direct navigation such as M / Cache or NoSQL solutions.

Good licked libraries negate the shortcomings of a particular language - they take most of the time to complete the task and allow you to almost not think about it - these are the mathematical libraries and, almost always, the runtime libraries.

Verification and evaluation - here only a quality education will help to understand the algorithms.

Human aspect

Ignorance of the language, too lazy to write correctly, too high level of abstraction leads to bad decisions on all fronts - for example, it can be easier to sort through the entire array or database with a standard iterator than to reorganize the data structure, it is easier to copy a couple of tens of megabytes for temporary analysis ...
Examples can be infinitely many.

And what is proposed to do with this?

The first answer is nothing. In the event that you are constrained by the operating restrictions, the inherited code, or simply lack of experience.

The second answer is to write in assembly language, use only the tools native to the system and the API.

If you are too lazy, too difficult, then you have a choice - I will offer the following set of steps from the programmer's paradise to the fall:

Null Using C. Pure C will increase your program only by the size of the run-time library (libc) in memory, and the modern optimizer will not spoil the executable code.

In fact, it is not necessary to take C - any compiled language suits your taste - ADA, Modula, Oberon, even Pascal / Delphi, preferably optimizing

-one. Using metaprogramming. Templates, macros, partly generics (although OOP already belong to the worst karma). Templates will be developed in a code without loss in productivity. All that their use threatens is code bloating due to the generation of copies of the code of the same algorithms for different data types.

-2 Using compiled object languages C ++, D, Rust, Go. What's bad about it? On the one hand, the fashion for objects has already passed everywhere, and many sober-minded developers criticize the object approach, which does not give a decisive gain in development. On the other hand, massive tried-and-true run-in and convenient frameworks. The problem is that the objects in the most part of the framework will be linked, even if they are not used in your program. Oops, and your executable file has exceeded a dozen megabytes. These megabytes occupy your disk, your memory, and processor cache. Although the overhead of function calls is measured by a couple of percent, it is difficult to measure the effect of CPU cache utilization.

-1024. Interpreted languages. PHP, CPython, partly Javascript. Thanks to them, the browser session has exceeded one hundred megabytes, and the loss at computing speed by 100 times is a normal matter.

-Pi. Virtual machines. These are languages with intermediate byte code based on the .NET platform or on the Java platform, and PyPy and V8 can also be included here. You always carry a byte with you - it is with you on the disk, it is with you in memory. Although, if you believe the tests, it seems that everything is almost normal - a loss on the calculations of tens of percent, well, up to a maximum of 100% in the worst cases. Only tests, as a rule, do not take into account the download speed of the virtual machine itself. And this is memory, memory, and again memory, tens and hundreds of megabytes.
Developers themselves know this problem and are moving towards compiling into native code. This is ART instead of Dalvika for Android, and Microsoft's .NET Native.

PS Well, for the final example - I have 3 user programs that are similar in functionality:

One is written in C ++ / Qt / Webkit
Another on C # .NET 4.5
Third on Python3 / wx

Here is the most responsive of them - on the python, and the slowest on the dotnet.

Conclusions - there are many aspects, and it is useless to take into account only one.

Source: https://habr.com/ru/post/326592/

All Articles