📜 ⬆️ ⬇️

We are trying to make a PDF book from a web comic with Haskell using the example of xkcd

After reading the article I decided to check how suitable Haskell is for this. I’ll say right away that Haskell itself is quite good, but, having run through hackage.haskell.org , I immediately found problems with libraries for working with PDF, which put an end to full implementation.
But I decided to still do some of the work in order to show how the same task could be done in Haskell, if only if ...

Get comic strip info

Since we will have to request information in the form of JSON, we will immediately write a useful function:
rethrow :: ( Show e ) => Exceptional ea -> IO a<br>
rethrow = switch ( throwIO . userError . show ) return<br>
<br>
jsonAt url = simpleHTTP ( getRequest url ) >>= getResponseBody >>= rethrow . decode_ json<br>
<br>
str s = member_ ( literal s ) JSON . string<br>
num n = member_ ( literal n ) JSON . number<br>


The rethrow function rethrow throw an exception in case of erroneous parsing, and str and num will simplify specifying the required fields in JSON.

Then the code for the latest comic and comic by number will look like this:
comics = jsonAt "http://xkcd.com/info.0.json" >>= fmap ( fromIntegral . numerator ) . rethrow . decode_ ( num "num" ) <br>
<br>
comic n = jsonAt ( concat [ "http://xkcd.com/" , show n , "/info.0.json" ] ) >>= rethrow . decode_ ( str "img" <&> str "title" ) <br>

The decode_ function decode_ received data in accordance with the specified pattern. In the case of num "num"
num "num"
we get from JSON a numeric term named num , in the case of str "img" <&> str "title"
str "img" <&> str "title"
- stupid of two lines for the picture and the title, respectively.

The code for getting the image by URL:
image url = simpleHTTP ( getRequest url ) >>= getResponseBody<br>

')
Download stream

Let's write down download of one comic book in separate function.
retrieve ch li = tryGet `onException` onFail where <br>
onFail = do <br>
writeChan ch ( i , Nothing ) <br>
writeLogger l Error $ "Comic " ++ show i ++ " failed to download" <br>
tryGet = do <br>
( imgUrl , title ) <- comic i<br>
imgData <- image imgUrl<br>
jpg <- writeBinaryFile fname imgData >> readJpegFile fname >>= either ( throwIO . userError . show ) return<br>
writeChan ch ( i , Just ( jpg , title ) ) <br>
writeLogger l Info $ "Comic " ++ show i ++ " downloaded" <br>
fname = show i ++ ".jpg" <br>


Here we used the ch channel ( Control.Concurrent.Chan ), to which we will send the download results, as well as the thread-safe l log.

Because of the curve of the HPDF library, you must first save the image to a file, and then load it from there again. It’s not at all clear to me why the author wrote JPEG parsing from scratch himself (and only from a file), and did not use the finished library.

PDF generation

Now it’s worth writing a function that, according to the list of pictures, will generate the resulting PDF for us.
pdf imgs = runPdf "Xkcd.pdf" doc ( PDFRect 0 0 800 600 ) $ forM_ imgs genPage where <br>
genPage ( jpeg , title ) = do <br>
img <- createPDFJpeg jpeg<br>
page <- addPage Nothing <br>
drawWithPage page ( drawText ( text ( PDFFont Times_Roman 12 ) 0 0 ( toPDFString title ) ) >> drawXObject img ) <br>
doc = PDFDocumentInfo { <br>
author = toPDFString "xkcd" , <br>
subject = toPDFString "xkcd" , <br>
pageMode = UseNone , <br>
pageLayout = OneColumn , <br>
viewerPreferences = standardViewerPrefs , <br>
compressed = False } <br>

In general, there is nothing interesting in this function. We call corresponding functions from libraries. It is only the nuance that the list of pictures is lazy that is important, so the function starts as soon as the first picture appears.

Putting it together

In the main function, we initialize the log, create a channel to which lightweight threads will write the result, and call the PDF generation with a lazy list of pictures from this channel.
main = bracket ( newLogger Console ) closeLogger $ \ l -> do <br>
n <- comics<br>
writeLogger l Info $ "Number of comics to download: " ++ show n<br>
ch <- newChan<br>
mapM_ ( fork . retrieve ch l ) [ 1 .. n ] <br>
cts <- fmap ( take n ) $ getChanContents ch<br>
let imgs = catMaybes $ mapMaybe ( `lookup` cts ) [ 1 .. n ] <br>
pdf imgs `onException` ( writeLogger l Error "Unable to generate PDF" ) <br>
writeLogger l Info "PDF generated." <br>

The bracket function is similar to using , guaranteeing the closure of the log.
mapM_ ( fork . retrieve ch l ) [ 1 .. n ] <br>
mapM_ ( fork . retrieve ch l ) [ 1 .. n ] <br>
we create an injection stream for each number, i.e. call retrieve ch li in a separate thread.
fmap ( take n ) $ getChanContents ch<br>
returns us a lazy list with the first n results. Take the whole list does not make sense, since the channel is endless.
Then we use the lookup function, with each index in order, from 1 to n . It is necessary to get a lazy list as a result, but in which the pictures are strictly in order. Thus, we will always write pictures in the correct order.

Full listing

main = bracket ( newLogger Console ) closeLogger $ \ l -> do <br>
n <- comics<br>
writeLogger l Info $ "Number of comics to download: " ++ show n<br>
ch <- newChan<br>
mapM_ ( fork . retrieve ch l ) [ 1 .. n ] <br>
cts <- fmap ( take n ) $ getChanContents ch<br>
let imgs = catMaybes $ mapMaybe ( `lookup` cts ) [ 1 .. n ] <br>
pdf imgs `onException` ( writeLogger l Error "Unable to generate PDF" ) <br>
writeLogger l Info "PDF generated." <br>
<br>
retrieve ch li = tryGet `onException` onFail where <br>
onFail = do <br>
writeChan ch ( i , Nothing ) <br>
writeLogger l Error $ "Comic " ++ show i ++ " failed to download" <br>
tryGet = do <br>
( imgUrl , title ) <- comic i<br>
imgData <- image imgUrl<br>
jpg <- writeBinaryFile fname imgData >> readJpegFile fname >>= either ( throwIO . userError . show ) return<br>
writeChan ch ( i , Just ( jpg , title ) ) <br>
writeLogger l Info $ "Comic " ++ show i ++ " downloaded" <br>
fname = show i ++ ".jpg" <br>
<br>
pdf imgs = runPdf "Xkcd.pdf" doc ( PDFRect 0 0 800 600 ) $ forM_ imgs genPage where <br>
genPage ( jpeg , title ) = do <br>
img <- createPDFJpeg jpeg<br>
page <- addPage Nothing <br>
drawWithPage page ( drawText ( text ( PDFFont Times_Roman 12 ) 0 0 ( toPDFString title ) ) >> drawXObject img ) <br>
doc = PDFDocumentInfo { <br>
author = toPDFString "voidex" , <br>
subject = toPDFString "xkcd" , <br>
pageMode = UseNone , <br>
pageLayout = OneColumn , <br>
viewerPreferences = standardViewerPrefs , <br>
compressed = False } <br>
<br>
rethrow :: ( Show e ) => Exceptional ea -> IO a<br>
rethrow = switch ( throwIO . userError . show ) return<br>
<br>
jsonAt url = simpleHTTP ( getRequest url ) >>= getResponseBody >>= rethrow . decode_ json<br>
<br>
str s = member_ ( literal s ) JSON . string<br>
num n = member_ ( literal n ) JSON . number<br>
<br>
comics = jsonAt "http://xkcd.com/info.0.json" >>= fmap ( fromIntegral . numerator ) . rethrow . decode_ ( num "num" ) <br>
<br>
comic n = jsonAt ( concat [ "http://xkcd.com/" , show n , "/info.0.json" ] ) >>= rethrow . decode_ ( str "img" <&> str "title" ) <br>
<br>
image url = simpleHTTP ( getRequest url ) >>= getResponseBody<br>
<br>
writeBinaryFile fname str = withBinaryFile fname WriteMode ( \ h -> hPutStr h str ) <br>


Swearing

Unfortunately, due to the lack of a decent library for working with PDF, the result did not pay off.
Most of the pictures HPDF refuses to accept (thanks to the regular bicycle implementation of the JPEG download), I did not even begin to understand the scaling of the pictures.

Praises

It was very convenient to test the request right in GHCi, then disassemble one of them, download and save the picture. All development was carried out there, and then the code was transferred to a file. Multithreading was bolted on without adding interfaces and any extra code. Instead of returning the result, we simply write it to the channel, on the other end of which is the handler. And we add a fork to the asynchronous function. In the general case, not everything is so simple, of course, but from my own experience I will say that I have never had to change the architecture for this.

In general, look at hackage.haskell.org , look for the necessary libraries, and if found, do not miss the chance to write everything in Haskell!

Source: https://habr.com/ru/post/113001/


All Articles