📜 ⬆️ ⬇️

Haskell and Java - comparison on a real task (satellites, ICFP Contest)

Today on Habré there was an article about Nemerle and ICFP 2009 . I would like to share my own recent research on this topic. My task was to write the perfect VM compiler from the job, do it on Haskell, and most importantly, compare the speeds of the resulting Java and Haskell code. It does not provide a complete solution for the ICFP, because this is an over-choice problem, and the VM in it is the innermost place on which the performance of the re-selectable solution depends, and this is what makes it interesting.



Introduction to the task: the organizers give a certain program in the form of a byte-code, to which the specification is given. The byte code does not contain transition codes, but contains only instructions of the form m1 <- m2 + m3 and the like, including the instruction of conditional selection from two cells. There are three addressable types of memory, each of which has a type of “double array”: the input parameters memory, which is read only, the working memory is read / write, and the output parameters memory is write only. One pass of the program gives the output coordinates of celestial bodies in the next point in time, the memory must be maintained between passes, because it contains the state of the world. Using this "simulator of the world", it is necessary to feed the program control input to a certain satellite, which should fly and do something in this world. All coordinates inside are calculated by known formulas, close to real ones, and everything turns out very nicely. These formulas are encoded in VM. Using this VM, it is necessary, ultimately, to optimally manage the satellite, and these control sequences are what the prize will ultimately give or may not give.
')
Before comparing Haskell and Java, I would like to clarify about the comparison of the speeds of programs on OCaml and C / C ++, which was discussed in the comments. In the original article, the C ++ interpreter was compared with the JAM compiler on Ocaml, and from there the superiority of OCaml in speed. In our variant, we will compare identical optimal virtual machines generated by the compiler - ahead of time.

What can I say about speed? Haskell on this task is more than 2 times more productive than Java. It is well known that if you generate the code in C, then the performance will be higher once again in 5-10: we tried with a non-ideal code even during the competition.

So, the task: task 4001 entirely, that is, all objects: 12 passive satellites + moon + our satellite. It is necessary to count the iteration / sec.

Hardware: Intel Q6600 non-overclocked, Arch Linux 64bit, GHC 6.10.3, JDK 1.6u14, all without multithreading: to measure the quality of Haskell as a compiler. Measurements in Java after Hotspot warmed up, and it happened for almost the first seconds, after which the result did not change)

Result (iterations per second, the whole counted world in its entirety in tasks 4001..4004 = 1 iteration):

* Java version (as sent to contest, simple compilation): 22K iter / sec
* Java version (perfect compilation): 44K iter / src
* Java version (perfect compilation, 32bit JDK): 32K iter / src
* Haskell version (with lazy data in VMState): 1.73K iter / sec, up to 8.1K iter / sec ==> fail.
* Haskell version (with strict data in VMState): 96K iter / sec.

A perfect VM in Java looks like this (meaningful snippet):

public class Vm {
double m2039;
{m2039=0.0; }
public void fw(double[] i, double[] o) {
double t60=((6.67428e-11)*(t59+t41))
double t61=(((((t53*t53)*t53) != 0) ? t60/((t53*t53)*t53) : 0));
m2039=(t12+1.0);
o[0]=((((Math.sqrt (((t1989*t1989)+(t1987*t1987))))-6357000.0)<0) ? ...
}
}


What do we have here? Everything temporal within an iteration is rendered to local variables, constant in members, expressions are shown in parentheses (except for reusable ones that went into local variables). Constants are made constants. Ports are made in an array, you could give them personal names, but they are few, and therefore this does not affect performance. One function call takes the object to a new state (tick +1). A nice feature of the task is the lack of links forward to temporary variables.

On the contrary, an imperfectly compiled virtual machine looks like this: instead of temporary variables - members. Instead of nested brackets - save the result in members at each step. That is, almost one-to-one according to the specification issued in the PDF to the task:

d57 = d2042;
d59 = (d13 == 0.0 ? d56 : d57);
d60 = d34 * d59;
d61 = (d55 != 0.0 ? d60 / d55 : 0.0);
d62 = d45 * d61;
d63 = d62 + d41;
d64 = d63 * d4;
d65 = d2114;
d66 = d9 * d9;
d67 = d11 * d11;


Now we look, how it is made on Haskele. It is not accepted to modify the object, and there are no simple means for this. Instead, we describe an object (algebraic data type), then a function that creates the initial instance, and we also write a function that creates another from one instance that is 1 tick from it. The code turns out to be similar to Java, but there are no temporary variables, instead of them letbindings. Testing consisted in displaying the entire state (to calculate all the lazy chains), after 45,000 iterations.

data VMState = VMState { m2039,...,o101::Double } -- " "
initState = VMState {m2039=0.0....} --
nextState x (i16000,i3,i2)= --
let .... --
t18=((if t13==0 then 3.84399e8 else m2051 x)) :: Double
in -- , " x, "
x { m2039= .... , o39=(if (t1699-(t12+t14*2))==0 then 0.0 else 1.0) }


In the VMState type, the constructor parameters (structure fields) are made lazy. If the VMState instance was spawned from the previous instance via nextState, then it contains lazy stubs (thunks, I don't know how to say) that are in the heap. And if we have 45,000 ticks to calculate, it is necessary to go through 45,000 copies, each of which has fairly (a hundred) parameters, and all are lazy. In short, this code gives us 1.73K iterations per second, and this is terrible. But it is worth replacing them with strict,

data VMState = VMState { m2039,...,o101:: !Double } -- . Double


as soon as everything starts to run fast. (96K iter / sec). In the profiler, we see that there is almost no allocation of extra memory (only VMState instances), and the load on the GC is extremely small ...

In principle, these surveys may already be enough, but perhaps intermediate options will give us some hope that lazy can give acceptable speeds?

To do this, we allow ourselves to take a subtask: in most cases, in tasks 4001-4004, we only need our own spatial position in response to our impulse, and satellite coordinates can be calculated once and for all (in advance). Let's try to require from VM only the position, not the whole structure with the coordinates of the planets, etc. Our position corresponds to the field o2, o3 (output ports 0 × 2, 0 × 3 from the specification). Since Haskell is a lazy language, we can use the same function (nextState, the lazy version) for our narrow purposes. In this case, the calculation speed is 4.5K iter / sec, which is twice as fast, because other values ​​are not calculated!

Now we will try to make these two fields strict (strict), and leave all the other fields lazy! The speed is 8.1K iter / sec! This is probably because the garbage collector needs to do a little less work, because after assigning the strict-field a chain of lazy calculations can be released immediately, and not only after the result is displayed at the end of all iterations. It becomes faster and easier to assemble.

Why Haskel faster than Java on our task? Probably because it compiles into a binary (executable), even taking into account that the assembler code generated in it is terrible compared to what C / C ++ produces. And, probably, here the compiler works more qualitatively, which can be sure that it is enough to calculate identical pieces in expressions (for example, under the conditions of “m12345 x == 0 ″) only once. This is a luxury that Java cannot afford. For Java, we would have to write additional code that details these nuances, but for Haskell, that's not necessary, a smart compiler.

Comments on the compiler code (which lies here) :

1. it generates the full lazy variant (strictness annotations were added by hands on top of the compilation result for research purposes)
2. constants are inserted in a user-friendly format, not bitwise (bitwise is necessary)
3. there is inside code that also generates Java, it is inactive, but there is.
4. compiler code with comments and empty lines, and the Java branch - 238 lines.
5. The compiler is sufficiently productive for bin4.obf (task 4), the code is not optimized (beauty for).

Improvements and notes to the compiler are welcome. If necessary, I can write a similar article about the compiler itself, namely about translating from flat expressions into expressions with brackets, etc., how simple this is done.

Thanks for attention.

Source: https://habr.com/ru/post/67225/


All Articles