Running in bags, blindfolded, back to front

What is the fastest programming language - not always practical, but very interesting question. Site benchmarksgame is just about that. The essence of the project is comparing the speed of programming languages on a number of typical tasks. I must say that the results are not always predictable. What if javascript is as fast as c? This is a scandal!

Pride and Prejudice

The ability to do something quickly quickly is always highly valued by its owner, often regardless of the quality of performance. - Jane Austen

Benchmarksgame is often referenced to prove the advantages or disadvantages of a particular programming language. However, you need to be careful. Those who are professionally engaged in measuring performance, know that in this case there are many pitfalls , and you can easily get into trouble. For example, a Java virtual machine takes some time to warm up. Accordingly, on too short tests the results will be unrepresentable. Fortunately, in terms of statistics, a very systematic approach is used on the site.

')

But the numbers still can not be trusted, and here is why.

Imagine that your favorite programming language, say, C. At the same time, in one of the comparisons, C is inferior to Java, and significantly, twice. Injustice! You open the solution code in C and see that it is not very carefully written, and obviously a lot can be improved and optimized. If at the same time a free evening is turned up and a pair of beer is on the table, a patch cannot be avoided. This approach is the main problem.

The idea of the site in comparison of standard, non-optimized solutions. Ideally, it is required that all programs be implemented using the same typical algorithm. It should not be used tricks, hacks, non-standard libraries and the like. Alas, the current owner of the project Isaac Gouy, despite the obvious professionalism and thoroughness in other matters, still allows such decisions.

For this reason, I will not give tables or graphs, but I will try to analyze in detail several tasks at the code level of various solutions.

Task: n-body

The king of Sweden Oscar II was an enlightened monarch. He was worried about many important questions, for example, would the Moon fall to Earth? In 1885, he announced a mathematical competition, at which the Three-Body Problem was presented - modeling the Earth-Moon-Sun system.

The winning solution was presented by Henri Poincare, and although it was not accurate, it nevertheless made a significant contribution to the development of mathematics and, in particular, the theory of chaos. In the general case, the task is called the N-body problem.

The n-body task at the benchmarksgame is a simulation of the Sun-Jupiter-Saturn-Uranus-Neptune system using the finite increment method. From a technical point of view, the task is a series of arithmetic operations on a small number of double type variables in a nested loop.

It is logical that interpreted languages show mediocre results: 3 minutes for Erlang , then PHP , Lua , Perl , Ruby , and ends the 13-minute Python series.

Fair decisions on compiled programming languages go in a tight group from 20 to 22 seconds in the following order: Chapel , C # , Go , OCaml , Swift , Java , Free Pascal . A little behind Node.js , TypeScript , Lisp and Dart - all in the region of 27 seconds.

The leaders

An unexpectedly good result was shown by Rust : 13 seconds with a fair decision. True, it is represented by a team of developers language Rust. Perhaps the simplicity of the solution is only apparent.

The undisputed winner is Fortran : 8 seconds, but there is a catch here too. The best solution is the result of four iterations of code improvement by various developers. Whether the final code is a model that any developer would write, the question is still controversial.

Cheaters

The solution on Haskell , although it is executed in 21 seconds, however, utilizes all 4 cores of the processor, which is not entirely fair.

On C, it was possible to optimize the program for up to 8 seconds, due to the use of the __m128d data type and the manual use of SSE2 instructions. It is difficult to call it a fair decision. With standard arithmetic, C executes in the same 20 seconds.

findings

Compiled programming languages are almost equally fast in mathematical calculations, they also include JavaScript (V8) and adjacent languages. So, if you suddenly want to simulate the movement of the planets in your application, do it in the browser. It will be just as fast and much more economical in terms of utilization of server resources.

Task: binary-trees

Although it does not have an equally deep historical context, it is no less curious than the previous one. The essence of the problem is in the sequential construction of a series of complete binary trees, when each parent has exactly two descendants, from 6 to 22 depth. Each constructed tree must be walked into the depth and a simple check-sum is calculated, which is the answer of the algorithm.

The goal is to measure the standard memory management mechanisms, so that the condition requires explicitly allocating and freeing memory for each node using the basic language tools. Accordingly, by condition it is clearly forbidden to allocate at the start an array of the size of 8388608 minus 1 element, and all in this spirit.

The leaders

It is not surprising that the best honest result is Java (a little more than 12 seconds), since the execution model fits very well with the garbage collection model in jvm. Since memory is allocated sequentially and then freed up in a large chunk, the cost of allocating Java memory on the heap is in this case comparable to the cost of allocating memory on the stack, i.e. practically free.

Even faster, it would be possible to work out only by turning on the Zero GC , i.e. completely disabling the garbage collector. Why not, if there is enough memory. The idea, by the way, is not new. The first Lisp implementation, 1958 , used just such a garbage collector. The memory was allocated until the system had free memory, and the implementation of the garbage collection algorithm was postponed until better times.

By comparison, a fair C decision using malloc and free on each node takes as many as 37 seconds. Well, such a task.

Cheaters

OCaml , 10 seconds - allocates memory by layers:

let workers = Array.init ((max_depth - d) / 2 + 1) (fun i -> let d = d + i * 2 in (d, invoke worker d))

Rust , 6 seconds - uses the concept of Arena and multithreading:

let long_lived_arena = Arena :: new (); let long_lived_tree = bottom_up_tree (& long_lived_arena, max_depth);

...

thread :: spawn (move || inner (depth, iterations))

Once again Rust , 4 seconds - uses Arend and parallel iterator (rayon :: prelude):

let arena = Arena :: new ();

let depth = max_depth + 1;

let tree = bottom_up_tree (& arena, depth);

...

let chk: i32 = (0 ... iterations) .into_par_iter (). map (| _ | {

...

Finally, C , 2 and a half seconds - uses apr_pools and preprocessor optimization:

apr_pool_t * thread_Memory_Pool; apr_pool_create_unmanaged (& thread_Memory_Pool);

...

#pragma omp parallel for

for (current_Tree_Depth = minimum_Tree_Depth; ...

findings

A memory management model with a garbage collector for Java / C # can be significantly more efficient than naive manual memory management in certain tasks.

Managing garbage collection in Dart , Node.js, and Go may require improvements: the result is about 40 seconds, and they could work as fast as Java. Although it is likely that the speed of the garbage collector in these languages is deliberately sacrificed to minimize memory consumption.

By tackling the optimization of memory management manually, you can achieve at least a 2-fold increase in performance, and it is not too difficult.

Task: thread-ring

You must create 503 streams connected in a ring. Accordingly, the 1st thread refers to the 2nd, 2nd to 3rd, and so on, the 503rd refers to the 1st. It is necessary to transfer the token between threads 50,000,000 times in order, and then print the number of the process that received the token last. Such a potato game .

Honest decisions

A neat solution would be to create 503 streams, connect them into a ring with 503 channels and transmit a message through them in a circle. For Java, it would be BlockingQueue, for Go - channel, for Erlang - embedded interprocess messages.

For Rust, it's about 3 minutes, Ruby - 5 minutes, for C # - 6 minutes. Unfortunately, there are no honest solutions for other languages.

Cheaters

Java , using LockSupport.park () and volatile managed to achieve 3 minutes of a penny. A similar approach (using Mutex) for Python 3 , OCaml , Lisp and C works up to 2 and a half minutes. It is curious that in all cases four processors are loaded on average by 30%, i.e. the overhead of semi-active waiting is about 5%.

Erlang solutions - 43 seconds, Smalltalk - 39 seconds, Chapel - 27 seconds, Go - 13 seconds and Haskell - 9 seconds are not counted, because in fact they used exactly one processor core, which gives no information and real interprocess communication performance in these languages. The decision on Go generally states: runtime.GOMAXPROCS (1), this is not serious. With the same success, it was possible to simply turn the cycle into fifty million iterations.

Another hack is C ++ : 29 seconds. The solution is built on the basis of asio.hpp, a library of asynchronous I / O, which is interesting in itself, but has nothing to do with the task to transfer the message between threads. Apparently, the solution on F # - 18 seconds - works according to the same principle, because it uses the async primitive to define a deferred function instead of a stream.

Conclusions instead of a leader

The leader, alas, no, because for languages like Go or Erlang, for which an honest decision would have to show good results, such a solution is not presented.

Multi-threaded communication is much more efficient on program threads (Erlang, Go Routines), especially if it is performed on one physical core. Juggling with real threads at the operating system level, with preservation and restoration of the full context, as well as prioritization within the framework of a common sheduler, at the level of all processes, is much slower.

Asynchronous I / O instead of real flows is a great thing, but we have known this since nginx and node.js.

Grand total

I am fast, I am very fast ... In the bedroom before bed, I hit the switch and have time to go to bed until the lights go out ... I am very fast. - Mohammed Ali

The desire to embellish reality, unfortunately, wins common sense in the souls of developers, at least on the site benchmarksgame. As a result, instead of being able to really compare various aspects of the performance of programming languages, we have a zoo with quite sophisticated code optimization techniques. It is, of course, curious, but a little bit wrong. But it seems it would be easy to restore order there.

As for the various studies based on the benchmarksgame, do not believe, see the code.

Source: https://habr.com/ru/post/346684/

All Articles

Running in bags, blindfolded, back to front

Pride and Prejudice

Task: n-body

The leaders

Cheaters

findings

Task: binary-trees

The leaders

Cheaters

findings

Task: thread-ring

Honest decisions

Cheaters

Conclusions instead of a leader

Grand total

More articles: