Algorithm for quickly finding similar images

Introduction

Recently I came across an article posted on Habrahabr devoted to the comparison of images "Looks like." How perceptual hash works . Since I myself have been working on this topic for quite a long time (I am the author of the AntiDupl program), I wanted to share my experience on this issue here. In the article I will give two variants of the algorithm for comparing similar images - the base one and the improved one. All of them were tested by the author in practice in the framework of the above project. My presentation will be carried out without rigorous proofs, complex formulas and special mathematical terminology. I hope that readers will forgive me for this.

Basic Algorithm

Measure of image similarity

When comparing similar images, the first question arises: what to consider as a measure of the similarity of images? Obviously, this value is the opposite of the difference between images from each other. Consequently, it is necessary to choose a certain metric characterizing the difference of images from each other. Then similar images will be considered images, the difference between which is less than a certain threshold. For images with the same dimensions, usually this measure of difference is the standard deviation of the pixels of one image from another. Although of course, nothing prevents us from choosing another metric, for example, the averaged absolute difference of pixels of images from each other.

Pictures of different sizes

The question is somewhat complicated if you want to compare images of different sizes. However, a fairly obvious solution to this problem is to bring all the images to the same size. It remains to choose this size - if you select it too small, then it is obvious that the differences between the images will then be leveled, and we will have many false positives. If the size is too large, the resource intensity of the comparison algorithm increases unnecessarily. Experience shows that the best size is a reduced image from 16x16 to 64x64. The experience also shows that the color characteristics of the image have a much smaller effect on the visual sensations of the similarity of the images in comparison with the brightness, so they can be discarded.
')

Basic steps

So, the simplest version of the algorithm for comparing similar images will include the following steps:

Bringing all images to the same size (take for definiteness size 32x32).
Dropping color information (converting to a gray image).
Finding the rms difference for each pair of reduced gray images.
Comparison of the obtained mean-square difference with a certain threshold.

Here the threshold determines the measure of image similarity. So if you select a threshold of 0%, then the algorithm will find only completely identical images. At 5% threshold, the algorithm will also be able to find visually similar images, which may vary in resolution, quality of compression, the presence of small inscriptions, crop, etc. with a small number of false positives. When the threshold is above 10%, as a rule, the number of false positives may exceed the number of positive positives. Undoubtedly, this algorithm in one form or another should be implemented in any program that searches for similar images. Variations usually relate to the method of obtaining a reduced image, the size of the reduced image and methods for quantizing it according to the color palette, choosing a metric that determines the degree of difference between images.

The disadvantages of the basic algorithm

The disadvantages of this algorithm include its poor sensitivity when comparing poorly contrasting images, as well as images without large-scale features (contour drawings, text images). Weak sensitivity is due to the fact that when the original image is reduced to a size of 32x32, as a rule, all such images become indistinguishable from each other. Also, a disadvantage of this approach is the quadratic dependence of the execution time of the algorithm depending on the total number of images, since a comparison should be made for each pair of images. So for comparing a pair of normalized images with each other, the algorithm needs to perform about 10 ^ 3 arithmetic operations, then for a thousand images it is already about 10 ^ 9 operations, and for a million images - 10 ^ 15 operations. Of course, not every user has a collection of images with a million images, but even to compare 100 thousand images (which is not such a rarity for amateur photographers), we need to perform about 10 ^ 13 operations, which can take significant time even on modern hardware.

Fast algorithm

As already mentioned, this basic implementation of the algorithm has two problems - low accuracy in images without large-scale features and the quadratic dependence of the execution time on the total number of images. The first problem is quite specific and characteristic of a fairly narrow circle of images, because then we will focus on solving the second problem. Obviously, this algorithm is well parallelized and vectorized. However, in this article I would like to dwell not on the software, but on the algorithmic optimization of this algorithm.

Aspect ratio

Visually identical images, as a rule, have close ratios of width and height. Hence the logical conclusion - to compare only images with close aspect ratios. In practice, most of the images in one collection, as a rule, have the same aspect ratio (for example, 3: 4), but have a different orientation (vertical or horizontal). Thus, we get two isolated classes of images that do not necessarily compare with each other. Approximate acceleration of algorithms from this fact is about two times.

Preliminary comparison

The first most obvious step: if we once reduced the image, then why not do it again? We make a 4x4 image from a 32x32 image, where each point of the reduced image represents the averaged brightness of the corresponding part of the original image 8x8 in size. It is possible to prove mathematically rigorously that the root-mean-square difference for this reduced image will be no more than the root-mean-square difference of the original image. Therefore, we can, by preliminary comparison of 4x4 thumbnails, create a kind of filter that will filter out most of the candidates for a full comparison (provided that most of the images are relatively unique in the sample). This simple technique can be achieved, as is easy to calculate, almost 50-fold acceleration. However, it is obvious that the higher the threshold with which we compare the images, the lower the efficiency of this filter.

One-dimensional image indexing

As you know, a search in an ordered data array can be carried out in a time of order O (ln (N)), as opposed to unordered, which requires O (N). Thus, if in some way we manage to arrange the images, then theoretically it is possible to reduce the quadratic complexity of the search for O (N ^ 2) to a quasilinear one - O (N * ln (N)). And so we will try to create an index with which we will sort the images. To do this, take our reduced picture size 4x4 and reduce it another two times to size 2x2. From this picture it is easy to form the first index:

i [0] = (p [0] [0] + p [0] [1] + p [1] [0] + p [1] [1]) / 4,

which represents the average brightness of our image. In fact, this image is reduced to the size of 1x1. The rms difference of the original images will be no less than the difference of their intensities. Thus, if the threshold for comparing images is t, then all possible candidates for comparison should have an average brightness in the range from i [0] - t to i [0] + t. In fact, if we sort all images by index i [0], then we can make a comparison with a much smaller number of images - 2 * t / 255. It is not difficult to calculate that for a typical threshold of 5% - we have a theoretical gain of 10 times. Practically, the gain will be less, since the statistical distribution of images in average brightness is not uniform in the range [0 ... 255], but has a bell-shaped maximum near the value of 127 and falls at the edges of the range. Its actual width, and, consequently, the practical gain of the algorithm from the theoretically possible, is about 70-80%.

Multidimensional Image Indexing

In addition to the average brightness i [0], 3 additional indices can be constructed from a 2x2 image:

i [1] = 128 + (p [0] [0] - p [0] [1] + p [1] [0] - p [1] [1]) / 4,

i [2] = 128 + (p [0] [0] + p [0] [1] - p [1] [0] - p [1] [1]) / 4,

i [3] = 128 + (p [0] [0] - p [0] [1] - p [1] [0] + p [1] [1]) / 4,

which have a simple physical meaning: the i [1] index shows how much the left part of the image is brighter than the right, i [2] - how much the upper part is brighter than the bottom, and i [3] - how much the image diagonal differs. You can also sort the images by all these indexes, as well as by the first one, and all possible candidates for comparison will lie in the ranges from i [j] - t to i [j] + t. Those. in the four-dimensional space formed by these indices, the search area will represent a 4-dimensional cube with a side of 2 * t and a center with coordinates formed by the indices of the desired image. The theoretical acceleration, equal to the ratio of the volume of this 4-dimensional cube to the total volume of the index space, is equal to (2 * t / 255) ^ 4 = 1/10000. Practical acceleration is much more modest, since the effective width of the distribution of images in the indices i [1] and i [2] is about 40-50%, and i [3] - only 20-30% of the maximum possible. Due to the narrow actual range, in practice the use of the last index is often not advisable at all, therefore one can limit oneself to the first 3 indices. Nevertheless, the achievable acceleration reaches two orders of magnitude (for a threshold difference of 5%).

The main stages of the improved algorithm

And so, we can summarize our reasoning and bring the main steps that need to be done in the frame of the algorithm for quickly comparing images.

Bringing all images to the same size (32x32) and discarding color information.
Finding a thumbnail for a preliminary comparison of 4x4 size.
Building an n-dimensional image index based on the image for preliminary comparison.
Defining the boundaries of the search area in the n-dimensional index space.
Checking the aspect ratio of candidates from this field.
Conduct a preliminary comparison of images with candidates past the previous paragraph.
Conduct a full-fledged comparison of images with candidates who have passed the previous paragraphs

Disadvantages of the fast algorithm

In general, the efficiency of indexing strongly depends on the image similarity threshold - at zero threshold, the complexity of searching for similar images approaches linear O (N), and for a large threshold to quadratic O (N ^ 2). Nevertheless, this is a significant progress, since in the basic version of the algorithm the search complexity is always quadratic O (N ^ 2). It is also worth noting that in practice, in order to implement sorting according to n-dimensional index space, it is necessary to organize an n-dimensional array with lists of images lying in a certain range of indices. This introduces additional overhead and makes it unnecessary to use large-scale indexing, as well as the inefficiency of such indexing for small image collections.

Conclusion

In general, if we summarize all the improvements that are used in the improved version of the algorithm, we get a gain of 3-4 orders of magnitude compared with the basic implementation. Tests on large collections of images show that this is true. So directly comparing images with each other takes about 30 seconds on a single-core processor for a collection of 100 thousand images. In practice, the comparison speed becomes an insignificant factor in the overall image search process - after all, they still need to be found, read from the disk and form thumbnails (although of course the last two items can be performed only once, since it is obvious that thumbnails weighing only 1000 it is advisable to save bytes for reuse).

I hope that readers are interested in my presentation. Perhaps even someone seemed useful. Waiting for feedback.

PS Software implementation

A lot of time has passed since this article was written. I decided to make a small addition to it. As part of the Simd project, a class was written in C ++ ImageMatcher , which implements the algorithm described in this article. Enjoy using!

Source: https://habr.com/ru/post/122372/

All Articles