📜 ⬆️ ⬇️

Recognition of corridors in the text

Corridor (river) - the coincidence of spaces along the vertical or inclined lines in three or more adjacent lines, one of the layout defects. The defect is eliminated fairly easily, but the difficulty lies in its automatic detection.

The corridor appears not only because of the specific arrangement of spaces, but because of the shape of the glyphs. For example, in two texts, spaces are located in the same places. In the first two corridors are clearly visible, and in the second there is no defect.



It is logical to apply here a method with text translation into a raster image and image processing.
')
When discussing the task at StackExchange , two simple and effective solutions were proposed. Perhaps someone they will also be useful.

1. Open an image with a black and white mask nPix-by-1, where nPix roughly corresponds to the line spacing, that is, the number of pixels between the lines.

opImg = imopen(bwImg,ones(13,1)); 



2. We process the image with a 1-by-mPix mask, where mPix is ​​the minimum corridor width. So we get rid of too thin lines.

 opImg = imopen(opImg,ones(1,5)); 



3. We get rid of horizontal “corridors”, which can be caused by either the indent of the first line, or the interval between paragraphs. We also remove large “lakes” by simply putting on a mask a little more than nPix-by-nPix. In this step, we also get rid of too small “rivulets” that are smaller than (nPix + 2) * (mPix + 2) * 4, that is, about three lines.

 %# horizontal river: just look for rows that are all true opImg(all(opImg,2),:) = false; %# open with line spacing (nPix) opImg = imopen(opImg,ones(13,1)); %# remove lakes with nPix+2 opImg = opImg & ~imopen(opImg,ones(15,15)); %# remove small fry opImg = bwareaopen(opImg,7*15*4); 



4. If not only the length, but also the thickness of the corridor is important to us, then we can draw a skeleton from points equidistant from the boundaries of the corridor, with the coloring of each point in accordance with the width of the corridor in this place.

  dt = bwdist(~opImg); sk = bwmorph(opImg,'skel',inf); %# prune the skeleton a bit to remove branches sk = bwmorph(sk,'spur',7); riversWithWidth = dt.*sk; 



In Mathematica, this is done with the help of erosion and the conversion of Hough into several lines of code.

 (*Get Your Images*) i = Import /@ {"http://i.stack.imgur.com/4ShOW.png", "http://i.stack.imgur.com/5UQwb.png"}; (*Erode and binarize*) i1 = Binarize /@ (Erosion[#, 2] & /@ i); (*Hough transform*) lines = ImageLines[#, .5, "Segmented" -> True] & /@ i1; (*Ready, show them*) Show[#[[1]],Graphics[{Thick,Orange, Line /@ #[[2]]}]] & /@ Transpose[{i, lines}] 


Source: https://habr.com/ru/post/170485/


All Articles