Principles of Fast Haskell under GHC

GHC (Glasgow Haskell Compiler) is a standard Haskell compiler. GHC is one of the coolest compilers in the world, but unfortunately without additional gestures, the programs compiled by it resemble interpreted ones in terms of speed, that is, they work very slowly. However, if you unlock the full potential of the compiler, Haskell is close in performance to the same code in C.

In this article, I summarize the experience of squeezing the maximum out of GHC when creating the Yarr dataflow framework .

The article consists of rather trivial separately observations, but I think it is suitable as a reminder or a start in the subject. The principles described are empirical, I do not know the internal structure of the GHC. I don’t know the compiler theory either, so if for some reason there are generally accepted terms, suggest in the comments.

The current GHC version is 7.6.1.
')
One final note: in fact, with “default” GHC, everything is not as bad as described in the first paragraph, even if you just compile with the -O2 option. Sometimes the compiler makes a fast machine code without a hint of a programmer, but it depends solely on the position of the stars , it is enough to fix one letter in the program so that the GHC suddenly stops understanding how to optimize it.

There are 2 main areas of optimization that are conducted by GHC:

Splitting data types into elementary pieces and transforming functions for working directly with them (see Unboxed types and primitive operations )
Inline and Function Specialization

They support each other, that is, if the data structures do not split, then functions are built in worse, and vice versa. Therefore, it makes no sense to follow only a couple of the following tips, there will be almost no result.

Data types

Avoid multiple constructor data types.

Including Maybe, Either and lists , no matter how “standard” they are. The reason is obvious - such data types cannot be decomposed into components, because it is not statically known what their structure is.

The only exception is enumeration types with several constructors without parameters: Bool , Ordering , etc.

If you think about it, the lists in Haskell are simply connected, which are practically not used in imperative languages because of their slowness. So why not use vectors instead of Haskell (instead of standard strings - ByteString ), which are hardly inferior to the lists in terms of convenience and capabilities. For replacing short lists with several elements, vectors of statically known length are even better suited.

See also: HaskellWiki - Performance / Data types .

Declare fields in structures to be strict, if you don’t know exactly why they should be lazy.

A normal, lazy field is a reference to a closure that returns a field value. Strict field - a link to the field value. The built-in (unboxed) field is the value itself (if it is also a structure, then it is fully embedded in the parent). The built-in field is obtained by adding the strict directive {-# UNPACK #-} .

Without the first option, as in the case of lists, it’s actually a pretty good life. In each case, make an informed choice in favor of 2 or 3 options. Fields by reference are sometimes more efficient in memory, faster when creating structures, checking for equality. Built-in fields are faster when reading. If built-in fields are most often needed (most likely), compile with the -funbox-strict-fields flag and add the {-# NOUNPACK #-} directive to the "reference" fields.

See also: Haskell Style Guide / Dealing with laziness .

Explicitly calculate structures before reuse.

In the example, the structure t many times when displaying a vector:

 import Data.Vector as V ... data T = T !Int !Int fSum (T xy) = x + y ... -- t :: T V.map (+ (fSum t)) longVectorOfInts

GHC does not take structure calculations out of a “loop”:

 --   GHC    V.map (\x -> case t of (T yz) -> x + y + z) longVectorOfInts

In the machine code at each iteration there will be a link following and reading the same fields.

However, if you “tell” the compiler outside the loop, it does not repeat the calculations already carried out somewhere higher on the ASD (abstract syntax tree):

 case t of (T xy) -> V.map (+ (fSum t)) longVectorOfInts -- GHC  : case t of (T xy) -> let s = x + y in V.map (\x -> x + s) longVectorOfInts

Instead of writing cases manually, you can use the generalized control flow from the deepseq package.

The example is not very successful, because the structure t used once and you can simply select fSum t as a variable, but I think the essence of the technique is clear.

Functions

Rule number 1: add to all small, often (on runtime) called, living far from the root of the ASD functions directive INLINE . If functions are not built in, the tips in this section become harmful.

More features, lambda, currying!

Or, smartly, use the continuation passing style .

The advantage of closures over data structures is that the GHC is much more positively configured to merge and reduce them.

For example, in the Yarr framework, I use a whole system of functions to parameterize calculations instead of configuration structures. The diagram shows the hierarchy of function types , the arrow “A → B” means “B is obtained by partial application from A”, burgundy arguments are selected, which in turn are also functions:

To ensure that the partially applied function will be specialized, in the definition, break the arguments with lambda:

 f :: A -> B -> C {-# INLINE f #-} fab = ... -- body where ... -- declarations --> fa \b -> let ... -- declarations in ... -- body

This trick is described in the INLINE pragma section of the GHC documentation.

Be careful with generic functions.

A call to a generalized function (a la true polymorphism in imperative languages) cannot be inline-defined by definition. Therefore, such functions must be specialized at the call site. If the GHC knows the exact type of the argument and the generic function itself is supplied with the INLINE directive, then it usually does so. When extending generalization to several levels of functions, it does not specialize.

Raise everything you can to the level of types

If the goal is to get a productive program, it is obviously better to compile it longer than to execute. In Haskell there are many mechanisms for transferring computations to the compilation stage .

In Yarr, I used marker type indices for static double dispatching , static arithmetic (ready-made arithmetic and logic is in the type-level library) and some other things.

Compilation flags

-O2 - always.

About -funbox-strict-fields is written above.

Now, with the exception of rare cases, faster machine code generates an LLVM -fllvm -optlo-O3 : -fllvm -optlo-O3 .

In order for GHC not to be engaged in fortune telling on the coffee grounds and embed everything that is said, Ben Lippmeier recommends using the -funfolding-use-threshold1000 -funfolding-keeness-factor1000 .

Source: https://habr.com/ru/post/171425/

All Articles