Compress and improve handwritten notes

I wrote a program to clean scanned abstracts while reducing the file size.

Original image and result:

Left: 300 DPI source scan, 7.2 MB PNG / 790 KB JPG. Right: Result with the same resolution, 121 KB PNG ^[1]
')
Note: The process described here more or less coincides with the work of the Office Lens application. There are other similar programs. I do not claim that I have come up with something radical new - it is just my implementation of a useful tool.

If you are in a hurry, just look at the GitHub repository or go to the results section , where you can play around with interactive 3D color cluster diagrams.

Motivation

There are no textbooks for some of my subjects. For such students, I like to arrange weekly “rewriting” when they share their notes with the rest and check how much they have learned. Summaries are laid out on the course website in PDF format.

The faculty has a “smart” copier that immediately scans to PDF, but the result of such a scan ... is less than pleasant. Here are some examples of student homework scanning:

As if at random, the copier chooses one of two things : either the binarization of the sign (x characters), or the conversion to terrible JPEG blocks (square root characters). Obviously, this result can be improved.

General view

Let's start by scanning such a beautiful page of student summaries:

The original PNG on 300 DPI weighs about 7.2 MB. A JPG image with a compression level of 85 takes about 790 KB. ² . Since PDF is usually just a container for PNG or JPG, when converting to PDF you will not compress the file even more. 800 kilobytes per page is quite a lot, and for the sake of speeding up the load, I would like to get something closer to 100 KB ³ .

Although this student is very careful about the summary, the scan result looks a bit sloppy (not his fault). The reverse side of the sheet is noticeably translucent, which distracts the reader and prevents the effective compression of JPG or PNG in comparison with the even background.

This is what the noteshrink.py program noteshrink.py :

This tiny PNG file is only 121 KB in size. Do you know what I like most? In addition to reducing the size, the outline has become more legible!

Basics of color and processing

Here are the steps to create a compact and clean image:

Determine the background color of the original scanned image.
Isolate the foreground by setting the threshold in other colors.
Convert to an indexed PNG by selecting a small number of “representative colors” in the foreground.

Before delving into the topic, it is useful to remember how color images are stored in digital form. Since in the human eye there are three types of color sensitive cells, we can restore any color by changing the intensity of red, green and blue light ⁴ . The result is a system that assigns colors to points in the three-dimensional RGB ⁵ color space. .

Although the true vector space allows for an infinite number of continuously varying intensity values, for storage in digital form, we sample colors, usually allocating 8 bits to each of the channels: red, green, and blue. However, if we consider colors as points in a continuous three-dimensional space, powerful analysis tools become available, as shown later in the process description.

Determining the background color

Since the main part of the page is blank, we can expect that the color of the paper will be the most frequently encountered color in the scanned image. And if the scanner always represented every point of blank paper as an identical RGB triplet, we would have no problems. Unfortunately, this is not the case. Random changes in color are caused by dust and stains on the glass, color variations of the paper itself, noise on the sensor, etc. Thus, the real “page color” can be distributed over thousands of different RGB values.

The original scanned image has a size of 2081 × 2531, the total area of 5,267,011 pixels. Although you can view all the pixels, it is much faster to take a representative sample of the original image. The program noteshrink.py by default takes 5% of the original image (more than enough for a scan of 300 DPI). But let's look at an even smaller subset of 10,000 pixels randomly selected from the source image:

It is a little like a real scan - there is no text, but the distribution of colors is almost identical. Both images are mostly greyish white, with a handful of red, blue, and dark gray pixels. These are the same 10,000 pixels sorted by brightness (that is, by the sum of the intensity of the R, G and B channels):

If you look from afar, then the lower 80-90% of the image seems to be the same color, but upon closer inspection, there are quite a lot of variations. In fact, the image most often has a color with an RGB value (240, 240, 242), and it is represented only in 226 out of 10,000 pixels - no more than 3% of the total.

Since fashion represents such a small percentage of the sample, the question arises how accurately it corresponds to the color distribution. It is easier to determine the background color of the page, if you first reduce the color depth . Here is what the sample looks like if you reduce the depth from 8 to 4 bits per channel, zeroing the four least significant bits :

Now the most common is the RGB value (224, 224, 224), which is 3623 (36%) of the selected points. In fact, by reducing the color depth, we grouped similar pixels into larger “baskets”, which made it easier to find the clear peak ⁶ .

There is a certain compromise between reliability and accuracy: small baskets provide better color accuracy, but larger baskets are much more reliable. In the end, to select the background color, I stopped at 6 bits per channel, which seemed the best option between the two extremes.

Foreground highlighting

After we have determined the background color, we need to select the threshold value (threshold) for the remaining pixels according to the degree of proximity to the background. A natural way to determine the similarity of two colors is to calculate the Euclidean distance between their coordinates in the RGB space. But this simple method incorrectly segmented some colors:

Here is a table with colors and Euclidean distance from the background:

Colour	Where found	R	G	B	Distance from background
white	background	238	238	242	-
Gray	shining from the back page	160	168	166	129.4
the black	blackened on the front page	71	73	71	290.4
red	blackened on the front page	219	83	86	220.7
pink	vertical line to the left	243	179	182	84.3

As you can see, the dark gray of the translucent ink, which needs to be classified as a background, is actually farther from the white of the page than the pink color of the line that we want to classify as the foreground. Any threshold in the Euclidean distance, which will make the pink color foreground, is sure to include translucent ink.

You can work around this problem by moving from RGB to Hue-Saturation-Value (HSV), which converts the RGB cube into a cylinder, shown here in section ⁷ .

In the HSV cylinder, a rainbow of colors is distributed around the circumference of the outer top edge; the hue value (hue) corresponds to the angle on the circle. The central axis of the cylinder goes from black below to white at the top with gray shades in the middle - the whole axis has zero saturation or intensity of color, and bright shades on the outer circumference have a saturation of 1.0. Finally, the value of the color value characterizes the overall brightness of the color, from black below to bright shades above.

So now let's look at our colors in the HSV model now:

Colour	Color value	Saturation	The difference with the background by value	The difference with the background in saturation
white	0,949	0,017	-	-
Gray	0.659	0.048	0,290	0.031
the black	0.286	0.027	0.663	0.011
red	0.859	0.621	0.090	0,604
pink	0.953	0.263	0,004	0.247

As expected, white, black and gray differ significantly in color, but they have similar low saturation levels — much less red or pink. Using additional information in the HSV model, you can successfully mark a pixel as belonging to the foreground if it meets one of the criteria:

the color value differs by more than 0.3 from the background or
saturation differs by more than 0.2 from the background

The first criterion takes the black ink from the pen, and the second - the red ink and the pink line. Both criteria successfully remove the gray pixels of translucent ink from the foreground. Different saturation thresholds / color values can be used for different images; See the results section for more information.

Select a set of representative colors.

As soon as we isolated the foreground, we received a new set of colors corresponding to the marks on the page. Let's visualize it - but this time we will consider colors not as a set of pixels, but as 3D points in the RGB color space. The resulting diagram looks slightly “crowded,” with several bands of related colors.

Interactive diagram by reference

Now our goal is to convert the original 24-bit image into an indexed color , choosing a small number of colors (in this case, eight) to represent the entire image. First, it reduces the file size, because the color is now determined by only three bits (since 8 = 2³). Secondly, the resulting image becomes more visually cohesive because similar color inks are likely to be assigned the same color in the final image.

To do this, we use the method based on the data from the "crowded" diagram mentioned above. If we choose the colors corresponding to the centers of the clusters, we obtain a set of colors that accurately represent the original data. From a technical point of view, we solve the problem of color quantization (which in itself is a special case of vector quantization ) using cluster analysis .

For this work, I chose a specific methodological tool - the k- average method . His goal is to find a set of averages or centers that minimize the average distance from each point to the nearest center. That's what happens if you select seven different clusters in our data set ⁸ .

Here the dots with black outlines correspond to the foreground colors, and the colored lines connect them to the nearest center in the RGB space. When the image is converted to an indexed color, each foreground color is replaced with the color of the nearest center. Finally, large circles show the distance from each center to the farthest associated color.

Gadgets and frills

In addition to setting color and saturation thresholds, noteshrink.py has several other notable features. By default, it increases the brightness and contrast of the palette by changing the minimum and maximum intensity values to 0 and 255, respectively. Without this, the eight-color palette of our scanned sample would look like this:

The adjusted palette is brighter:

There is an option to force the background to change to white after isolating the foreground colors. To further reduce the file size, noteshrink.py can automatically run PNG optimization tools such as optipng , pngcrush, and pngquant .

At the output, the program produces such PDF-files with several images using the conversion program from ImageMagick. As an added bonus, noteshrink.py automatically sorts file names numerically in ascending order (and not alphabetically, like glob in the console). This is useful when your stupid scan program ⁹ produces at the output the names of files like scan 9.png and scan 10.png , and you want the pages in PDF to be in order.

results

Here are some more examples of the output of the program. The first ( PDF ) looks great with default thresholds:

Visualization of color clusters:

For the following ( PDF ), it was necessary to lower the saturation threshold to 0.045, because the gray-blue lines are too dim:

Color Clusters:

Finally, an example of a graph paper scan ( pdf ). For her, I set a threshold value of 0.05, because the contrast between the background and the lines is too small:

Color Clusters:

Together, four PDF files occupy about 788 KB, an average of about 130 KB per page.

Conclusions and future work

I am glad that I managed to create a useful tool that can be used in preparing PDF files with notes for my courses. In addition, I really liked to prepare this article, especially because it prompted me to try to improve the important 2D visualizations that are shown in the Wikipedia article on color quantization , and also finally learn three.js (a very fun tool, I will use it again ).

If I ever come back to this project, I would like to play with alternative quantization schemes. This week I had the idea to use spectral clustering on the nearest neighbors graph in a set of colors - it seemed to me that I came up with an interesting new idea, but then I found an article from 2012 , which proposes exactly this approach. Okay.

You can also try the EM-algorithm for the formation of a Gaussian mixture model that describes the distribution of colors. Not sure what this was often done in the past. Other interesting ideas are: an attempt at clustering in a “perceptually uniform” color space like L * a * b * , as well as an attempt to automatically determine the optimal number of clusters for a given image.

On the other hand, you need to deal with other topics for the blog, so for now I’ll leave this project and invite you to look at the noteshrink.py repository on GitHub.

Notes

1. Samples of abstracts are presented with the generous permission of my students Ursula Monaghan and John Larkin. ↑

2. The image shown here is actually reduced to 150 DPI so that the page loads faster. ↑

3. The only thing that our copier does well is reduce the size of PDF to 50-75 KB per page for documents of this type. ↑

4. Red, green, and blue are the primary colors in the additive model . An elementary school drawing teacher could tell you that the primary colors are red, yellow, and blue. This is a lie [method of simplifying complex concepts in the education system - note. trans.]. However, there are three subtractive primary colors: yellow, magenta, and cyan. Additive primary colors refer to combinations of light (which monitors emit), whereas subtractive colors refer to combinations of pigment in inks and paints. ↑

5. Image of Maklaan in Wikimedia Commons. License: CC BY-SA 3.0 ↑

6. See the tip distribution chart on Wikipedia as another example of why increasing the size of the basket is useful. ↑

7. Image SharkD user in Wikimedia Commons. License: CC BY-SA 3.0 ↑

8. Why k = 7, but not 8? Because we need eight colors in the final image, and we have already set the background color ... ↑

9. Yes, I look at you, Image Capture from Mac OS ... ↑

Source: https://habr.com/ru/post/351266/

All Articles