
Sounding of the situation
at Reddith showed that hardly anyone is seriously engaged in image processing on Haskell, despite the fact that the fairly popular
Repa library suggests working with images as one of the main applications. I hope the Yarr library can change the situation (
documentation ,
github ).
I call the dataflow framework, because it is generalized to handle arrays (from one-dimensional to three-dimensional) elements of any type, including vectors of numbers, such as coordinates, complex numbers. But the main intended use is the processing of two-dimensional arrays of vectors of color components, i.e. images. The framework does not directly contain image processing algorithms, but provides a powerful infrastructure for writing them.
')
The essence
In the framework, a family of types is defined - array representations (
UArray
). Representations are divided into 2 genders:
- Explicit , array elements in such representations are indeed somewhere in memory. There are views based on low-level GHC arrays, based on raw memory blocks from malloc - only for elements from the Storable class.
- Implicit representations encapsulate a function (index -> element).
There are also a huge number of functions in three main areas:
- Mappings (map), combining functions (zipWith) and convolutions by kernel (convolution) take one or several arrays as input and return an implicit array.
- The calculating (
Load
) functions translate an implicit array into an explicit one, that is, they write the first one into memory in some form. - Functions of the passage (
Walk
) on the array, including with the state - convolution (fold).
Those familiar with the Repa library may notice that Yarr is very similar to her. The main differences:
- Most functions have "component-wise" options, which is convenient when working with arrays of vectors, such as images in the RGB format.
- The possibilities of convolutions are greatly expanded (fold).
- Yarr is often faster than Repa. For example, the Canny boundary detector on the Yarr works about 2 times faster than the detector on the Repa.
Examples
When using the framework, it turns out rather a mixture of imperative and functional than purely functional code, because literally everything is immersed in the
IO
monad. This is a high performance charge.
The calculation of the histogram based on lightness is one of the stages of equalization (the first arrow in the upper picture).
import Data.Yarr import Data.Yarr.Walk import Data.Yarr.Shape as S import Data.Yarr.Repr.Foreign ... computeHist :: UArray FS L Dim2 Float -> IO (UArray FL Dim1 Int) computeHist pixelValues = walkP caps (mutate (S.unrolledFill n8 noTouch) incHist) (newEmpty 256) joinHists pixelValues where incHist :: UArray FL Dim1 Int -> Float -> IO () incHist hist value = do let level = truncate (value * 255.99) c <- index hist level write hist level (c + 1) joinHists h1 h2 = do loadS S.fill (dzip2 (+) h1 h2) h1 return h1
This article provides measurements of the
minimum execution time from a variety of launches,
calculated per pixel of the image , on a different number of threads. The unit of measure is the processor clock. Yarr version - 1.2.3,
test configuration ,
general Makefile ,
test image (2560 × 1600, 6 MB).

The calculation of the direction and strength of the gradient at a point from the smoothed black and white image is one of the stages of border detection (the second arrow in the upper picture).
Function gradientMagOrient, code screenI do not think that it makes sense to explain each line, examples are given to show how the code on Yarr looks in general.
{-# LANGUAGE QuasiQuotes #-} ... import Data.Yarr.Convolution ... noOrient = 0 :: Word8 posDiag = 64 :: Word8 vert = 128 :: Word8 negDiag = 192 :: Word8 horiz = 255 :: Word8 gradientMagOrient :: Word8 -> UArray FL Dim2 Float -> FImage (VecTuple N2 Word8) -> IO () gradientMagOrient !threshLow image target = loadP S.fill caps delayedMagOrient target where delayedMagOrient = dzip2 magnitudeAndOrient gradientX gradientY

Here is a comparison of the performance of border detectors entirely, on Yarr and OpenCV, the most popular image processing library.
Code on Yarr ,
code on OpenCV . OpenCV does not support parallel processing of a single image.

Conclusion: select borders on a large number of ordinary images will be faster on OpenCV (if scatter images on streams), and on one giant, in a couple of gigapixels, photos - faster on Haskell.
Self-help optimizations
Automatic parallelization
In the examples above, you might have paid attention to functions with the suffix
-P
and the first argument
caps
:
walkP
,
loadP
. It is they who are responsible for the "magic" acceleration of programs when specifying the number of cores available to the process via the console.
caps
is an abbreviation for
getNumCapabilities
function from the standard
Control.Concurrent
module. Instead of
caps
you can write, for example,
(threads 2)
, with a clear result.
Tools for maximum utilization of the capabilities of a specific processor
If the calculation on the element is quite simple, and the pipeline has a wide target processor, you can
turn the cycles into several iterations to get a better conveyor belt. An example is in the 1st line of the
computeHist
function (the histogram calculation is expanded over 8 iterations). The performance of this computation with other parameters of unfolding, and completely without it (the scale is slightly logarithmic):

As you can see, in this case, the deployment allows you to speed up the program about three times. Intuitively, this is explained by the fact that in the pipeline of my processor there are 3 pipelines. With a long unfolding, they work to the fullest, and without unfolding - 2 of 3 are idle.
Yes, unfortunately, neither the GHC's own backend, nor the LLVM (it is supposed to be used with Yarr) actually do not deploy the cycles themselves, although at least the LLVM theoretically should.
On the contrary, when processing arrays of vectors, if the calculations of individual components are independent and at the same time contain a lot of arithmetic, you can try to
calculate the components separately in time (in a parallel version, they will also be distributed across different streams) to unload the pipeline and the register file. For example, on my processor, this gives a performance increment of 15% with Gaussian smoothing of a color image with a 5 × 5 core (
blur.hs ). Component-based variants of the functions of calculation and iteration have the infix
-Slices-
or
-SlicesSeparate-
.
2D unfolding of kernel convolution calculation cycles to reduce read operations
For example, let's take the Gaussian smoothing of a black and white image with a 3 × 3 core:

If the values of 4 neighboring pixels are calculated in a row, the LLVM can prove that the initial values of the pixels cannot change during repeated readings (since there are no write operations between them) and reuse the latter.

Above is the original two-dimensional array of pixel intensities, red, blue, yellow, and green rectangles — the kernels for calculating smoothed pixel values by coordinates (y = 2, x = 2), (2, 3), (3, 2) and (3, 3) respectively. The multipliers in the core are written only for the first pixel.
If there are enough general-purpose registers in the processor, LLVM compiles such a calculation with 16 read operations instead of 36 - the GVN (Global Value Numbering) optimization works.
In the example, it is used to expand to 2 indices horizontally and 2 vertically. Of course, this is not always optimal. The speed of such smoothing depending on the unfolding steps:

In the framework, there is a function for parameterized unfolding of the calculation of two-dimensional arrays in both dimensions,
dim2BlockFill .
disadvantages
Cycles do not vectorize. As in the case of deployment, LLVM promises, but in fact, it does not vectorize anything. LLVM 3.3
announced an improvement in this direction, so in the foreseeable future we can hope for even greater acceleration of the framework. Currently, the stable backend of GHC 7.6.1 is considered to be LLVM 3.1.
There is no support for stream filtering and combining arrays (streaming). This feature will be in Repa 4 (version in development). It will not be implemented in Yarr, because for this it is necessary to rewrite the library from scratch, to which there are no people willing. Screw the side with the current architecture will not work. However, Ben Lippmeier also rewrites Repa 3 from scratch, so the motivation was found :)
Implementation
Most of the mapping / combining functions (maps, zips) and functions that convert an array from an implicit representation to an explicit (
Load
) are defined in classes with several parameters. Simplified, in the case of mappings, the 1st parameter is the type of the original array, the 2nd is the type of output implicit array. In the case of calculations, the 1st parameter is the type of the input implicit array, the 2nd is the type of the resulting calculated explicit array. As a result, when displaying in implicit arrays, information about the “underlying” explicit sources is stored. When calculating the knowledge of the structure of the input and output arrays allows you to choose the best way to iterate.
It turns out a kind of static double dispatch. The topic is
covered in more detail in the
curriculum ,
and most of the notes are given to it.
Separate post -
on the fight against GHC slowness .
Functions with unrolling cycles, for example
unrolledFill
, take as an
unrolledFill
step a number not ordinary (
Int
), but raised to the level of types. In the
computeHist
function (from the example above),
n8
is a value of type
N8
, which belongs to the
Arity
class from the
fixed-vector library. In general, the framework is actually based on this library, it is used very actively.
Inside functions like
unrolledFill
,
an action vector of a given arity is formed . GHC completely reduces the vector, after compilation there remains a flat sequence of actions, that is, just an unfolding. Amazingly, GHC copes even with the reduction of double nesting vectors, with which the
dim2BlockFill
function is
dim2BlockFill
!
Development
The most difficult thing when developing a framework was to invent an architecture, a type system. I drastically changed it about 8 times, looking at my design
note on Haskell through tears of self-irony. Abstractions flowed, even broke more quickly, the architecture did not allow to implement the necessary functions or was incompatible with the GHC optimizer, and I created a new folder and started writing from the beginning.
The first 3-4 approaches were attempts to tie componentwise operations to Repa. When I finally realized that this was impossible, I specifically moved away from the Repa architecture as far as possible. The remaining “iterations” again brought the Repa design framework closer to it; as a result, from the outside, it again resembles just the Repa extension.
It's funny that Aleksey Khudyakov laid out the fixed-vector library only in November last year, i.e. 2 months after I decided to write the framework (I took the curtain theme). But since started in December, it turned out without much difference. You can say I was very lucky :)