📜 ⬆️ ⬇️

Yarr - dataflow framework (image processing) on ​​Haskell



Sounding of the situation at Reddith showed that hardly anyone is seriously engaged in image processing on Haskell, despite the fact that the fairly popular Repa library suggests working with images as one of the main applications. I hope the Yarr library can change the situation ( documentation , github ).

I call the dataflow framework, because it is generalized to handle arrays (from one-dimensional to three-dimensional) elements of any type, including vectors of numbers, such as coordinates, complex numbers. But the main intended use is the processing of two-dimensional arrays of vectors of color components, i.e. images. The framework does not directly contain image processing algorithms, but provides a powerful infrastructure for writing them.

')

The essence


In the framework, a family of types is defined - array representations ( UArray ). Representations are divided into 2 genders:


There are also a huge number of functions in three main areas:


Those familiar with the Repa library may notice that Yarr is very similar to her. The main differences:


Examples


When using the framework, it turns out rather a mixture of imperative and functional than purely functional code, because literally everything is immersed in the IO monad. This is a high performance charge.

The calculation of the histogram based on lightness is one of the stages of equalization (the first arrow in the upper picture).
 import Data.Yarr import Data.Yarr.Walk import Data.Yarr.Shape as S import Data.Yarr.Repr.Foreign ... computeHist :: UArray FS L Dim2 Float -> IO (UArray FL Dim1 Int) computeHist pixelValues = walkP caps (mutate (S.unrolledFill n8 noTouch) incHist) (newEmpty 256) joinHists pixelValues where incHist :: UArray FL Dim1 Int -> Float -> IO () incHist hist value = do let level = truncate (value * 255.99) c <- index hist level write hist level (c + 1) joinHists h1 h2 = do loadS S.fill (dzip2 (+) h1 h2) h1 return h1 


This article provides measurements of the minimum execution time from a variety of launches, calculated per pixel of the image , on a different number of threads. The unit of measure is the processor clock. Yarr version - 1.2.3, test configuration , general Makefile , test image (2560 × 1600, 6 MB).


The calculation of the direction and strength of the gradient at a point from the smoothed black and white image is one of the stages of border detection (the second arrow in the upper picture).
Function gradientMagOrient, code screen
I do not think that it makes sense to explain each line, examples are given to show how the code on Yarr looks in general.
 {-# LANGUAGE QuasiQuotes #-} ... import Data.Yarr.Convolution ... noOrient = 0 :: Word8 posDiag = 64 :: Word8 vert = 128 :: Word8 negDiag = 192 :: Word8 horiz = 255 :: Word8 gradientMagOrient :: Word8 -> UArray FL Dim2 Float -> FImage (VecTuple N2 Word8) -> IO () gradientMagOrient !threshLow image target = loadP S.fill caps delayedMagOrient target where delayedMagOrient = dzip2 magnitudeAndOrient gradientX gradientY -- Sobel-X operator application gradientX = dConvolveLinearDim2WithStaticStencil [dim2St| -1 0 1 -2 0 2 -1 0 1 |] image -- Sobel-Y gradientY = dConvolveLinearDim2WithStaticStencil [dim2St| 1 2 1 0 0 0 -1 -2 -1 |] image magnitudeAndOrient :: Float -> Float -> VecTuple N2 Word8 magnitudeAndOrient gX gY = VT_2 (mag, if mag < threshLow then noOrient else orient gX gY) where mag = floatToWord8 $ sqrt (gX * gX + gY * gY) orient :: Float -> Float -> Word8 orient gX gY | atan_1q < 0.33 = horiz | atan_1q > 0.66 = vert | otherwise = posDiag + diagInv where rr = gY / gX (r, diagInv) = if rr < 0 then (negate rr, negDiag - posDiag) else (rr, 0) -- 2nd order Taylor series of atan, -- see http://stackoverflow.com/a/14101194/648955 br = 0.596227 * r num = br + (r * r) atan_1q = num / (1.0 + br + num) 




Here is a comparison of the performance of border detectors entirely, on Yarr and OpenCV, the most popular image processing library. Code on Yarr , code on OpenCV . OpenCV does not support parallel processing of a single image.

Conclusion: select borders on a large number of ordinary images will be faster on OpenCV (if scatter images on streams), and on one giant, in a couple of gigapixels, photos - faster on Haskell.

Self-help optimizations


Automatic parallelization

In the examples above, you might have paid attention to functions with the suffix -P and the first argument caps : walkP , loadP . It is they who are responsible for the "magic" acceleration of programs when specifying the number of cores available to the process via the console. caps is an abbreviation for getNumCapabilities function from the standard Control.Concurrent module. Instead of caps you can write, for example, (threads 2) , with a clear result.

Tools for maximum utilization of the capabilities of a specific processor

If the calculation on the element is quite simple, and the pipeline has a wide target processor, you can turn the cycles into several iterations to get a better conveyor belt. An example is in the 1st line of the computeHist function (the histogram calculation is expanded over 8 iterations). The performance of this computation with other parameters of unfolding, and completely without it (the scale is slightly logarithmic):

As you can see, in this case, the deployment allows you to speed up the program about three times. Intuitively, this is explained by the fact that in the pipeline of my processor there are 3 pipelines. With a long unfolding, they work to the fullest, and without unfolding - 2 of 3 are idle.

Yes, unfortunately, neither the GHC's own backend, nor the LLVM (it is supposed to be used with Yarr) actually do not deploy the cycles themselves, although at least the LLVM theoretically should.

On the contrary, when processing arrays of vectors, if the calculations of individual components are independent and at the same time contain a lot of arithmetic, you can try to calculate the components separately in time (in a parallel version, they will also be distributed across different streams) to unload the pipeline and the register file. For example, on my processor, this gives a performance increment of 15% with Gaussian smoothing of a color image with a 5 × 5 core ( blur.hs ). Component-based variants of the functions of calculation and iteration have the infix -Slices- or -SlicesSeparate- .

2D unfolding of kernel convolution calculation cycles to reduce read operations

For example, let's take the Gaussian smoothing of a black and white image with a 3 × 3 core:



If the values ​​of 4 neighboring pixels are calculated in a row, the LLVM can prove that the initial values ​​of the pixels cannot change during repeated readings (since there are no write operations between them) and reuse the latter.



Above is the original two-dimensional array of pixel intensities, red, blue, yellow, and green rectangles — the kernels for calculating smoothed pixel values ​​by coordinates (y = 2, x = 2), (2, 3), (3, 2) and (3, 3) respectively. The multipliers in the core are written only for the first pixel.

If there are enough general-purpose registers in the processor, LLVM compiles such a calculation with 16 read operations instead of 36 - the GVN (Global Value Numbering) optimization works.

In the example, it is used to expand to 2 indices horizontally and 2 vertically. Of course, this is not always optimal. The speed of such smoothing depending on the unfolding steps:

In the framework, there is a function for parameterized unfolding of the calculation of two-dimensional arrays in both dimensions, dim2BlockFill .

disadvantages


Cycles do not vectorize. As in the case of deployment, LLVM promises, but in fact, it does not vectorize anything. LLVM 3.3 announced an improvement in this direction, so in the foreseeable future we can hope for even greater acceleration of the framework. Currently, the stable backend of GHC 7.6.1 is considered to be LLVM 3.1.

There is no support for stream filtering and combining arrays (streaming). This feature will be in Repa 4 (version in development). It will not be implemented in Yarr, because for this it is necessary to rewrite the library from scratch, to which there are no people willing. Screw the side with the current architecture will not work. However, Ben Lippmeier also rewrites Repa 3 from scratch, so the motivation was found :)


Implementation


Most of the mapping / combining functions (maps, zips) and functions that convert an array from an implicit representation to an explicit ( Load ) are defined in classes with several parameters. Simplified, in the case of mappings, the 1st parameter is the type of the original array, the 2nd is the type of output implicit array. In the case of calculations, the 1st parameter is the type of the input implicit array, the 2nd is the type of the resulting calculated explicit array. As a result, when displaying in implicit arrays, information about the “underlying” explicit sources is stored. When calculating the knowledge of the structure of the input and output arrays allows you to choose the best way to iterate.

It turns out a kind of static double dispatch. The topic is covered in more detail in the curriculum , and most of the notes are given to it.

Separate post - on the fight against GHC slowness .

Functions with unrolling cycles, for example unrolledFill , take as an unrolledFill step a number not ordinary ( Int ), but raised to the level of types. In the computeHist function (from the example above), n8 is a value of type N8 , which belongs to the Arity class from the fixed-vector library. In general, the framework is actually based on this library, it is used very actively.

Inside functions like unrolledFill , an action vector of a given arity is formed . GHC completely reduces the vector, after compilation there remains a flat sequence of actions, that is, just an unfolding. Amazingly, GHC copes even with the reduction of double nesting vectors, with which the dim2BlockFill function is dim2BlockFill !

Development


The most difficult thing when developing a framework was to invent an architecture, a type system. I drastically changed it about 8 times, looking at my design note on Haskell through tears of self-irony. Abstractions flowed, even broke more quickly, the architecture did not allow to implement the necessary functions or was incompatible with the GHC optimizer, and I created a new folder and started writing from the beginning.

The first 3-4 approaches were attempts to tie componentwise operations to Repa. When I finally realized that this was impossible, I specifically moved away from the Repa architecture as far as possible. The remaining “iterations” again brought the Repa design framework closer to it; as a result, from the outside, it again resembles just the Repa extension.

It's funny that Aleksey Khudyakov laid out the fixed-vector library only in November last year, i.e. 2 months after I decided to write the framework (I took the curtain theme). But since started in December, it turned out without much difference. You can say I was very lucky :)

Source: https://habr.com/ru/post/170571/


All Articles