Image search by fragment

In his speech, Alexander Krainov told how Yandex.Kartinks clustered duplicate images. In other words, duplicate images were isolated and filtered. Where the main idea was to select the outlines of the image through the DoG filter, then find the key points and get their descriptors.
Duplicate clustering is reduced to the search for matching descriptors. This is the “digital format” of key points from the article Clustering duplicates in search by pictures .

But I would like to know in more detail what these descriptors are and how to search them.

Descriptors

Key points are points that ideally should not change when the image is modified or modified.
Descriptors are, in general terms, a convolution of characteristics and the representation of key points in a format that is available for checking for a match.
')
The search for effective allocation of key points, their descriptors, as well as methods for checking for coincidences, is still on the agenda.

But we have to start somewhere, so let's turn to the OpenCV library for help.

The first thing you look at is the SURF handles.
Which promise extraordinary accuracy. Which is confirmed after the tests.

But there are a few nuances.
SURF descriptors are vectors of 128 (or 64) numbers per key point. The match check is performed by searching for the nearest point (or even two). And the closer the point, the better.

It turns out that the image with 1,000 key points will require 128,000 floating point numbers.
In addition, the detection of points itself is quite a complicated operation and takes considerable time. What does not allow to effectively use this algorithm on small devices. In addition, the algorithm itself is closed and patented (in the USA).

After getting acquainted with SIFT and SURF, I wanted something simple to implement with the ability to apply on a small server or device.

Perceptual hashes

And perceptual or perceptual hashes were found.
The task of which is that with a slight change in the image hash also slightly changed.

Matching checks are performed by counting the number of different positions between two hashes. Those. counting distance hamming . And the smaller it is, the less distinct elements, the greater the coincidence.

This method is designed to search for full or partial duplicates of the image. Those. with a significant change in the image format or interfering with the content makes it impossible to check for a match, since the hashes will be significantly different.

In other words, perceptual hashes are not suitable for searching for semi-duplicates.

Based on this, an attempt was made to combine SURF descriptors and perceptual hashes in order to solve the problem of finding fuzzy semi-duplicates.

The method is based on clustering key points in such a way that the centers of clusters coincide on the original and cropped image. And then a fragment of the image was extracted around the key point and a perceptual hash was obtained. As a result, one image gave about 200 croppings of cuts and their hashes.

This method has two and a half significant drawbacks:
1. low matching rate on a large set of hashes. Search for 1 million hashes took 20 seconds
2. low speed of getting key points
3. low accuracy, a lot of false positives, high requirements for the target base, not suitable for all pictures, requires pre-moderation, etc.

The very idea that a certain number of fingerprints would stand out from the image, which could be simply compared with each other, was fascinating.

Therefore, it was decided to try to find solutions to these problems.

Low sampling rate

The difficulty of finding and calculating the Hamming distance on a large data set is an independent problem and requires an independent approach.
After some research on the subject, it turned out that there are many solutions to this problem.
The most effective available algorithm named HEngine was selected and implemented, which allowed ~ 60 times the speed of selection from the database.

SURF and key points

Since we are already working with binary hashes, or fingerprints, and the coincidence is considered the Hamming distance, it is strange to use such a colossus as SURF and it would be worthwhile to consider other methods for obtaining key points and descriptors.

In general, OpenCV provides two types of descriptors:

- floating point handles
- And binary descriptors

Here are the binary descriptors as no other are suitable for our task, because they also use Hamming distance to check for matches.

ORB: an efficient alternative to SIFT or SURF

OpenCV already has a great alternative to SURF, which is not only open and without licensing restrictions, it is even easier and works faster [1].

ORB is Oriented FAST and Rotated BRIEF — an improved version and combination of FAST key point detector and BRIEF binary descriptors.

ORB has one significant nuance for us - the size of the descriptors is 32 bytes per point.
The coincidence check is the sum of the Hamming distances for each byte of the descriptor (the first is compared with the first, the second with the second, and so on).

In our task it is assumed that one point gives one value, and here it turns out 32, which must also be summed up with the corresponding index (first with first, second with second and so on) in the target database.

Since our hash is a 64-bit number, it takes 32 bytes of the descriptor to be squeezed into 8 bytes and it does not lose much in accuracy.

After some tests, it was decided to try to represent these 32 bytes as a 16x16 bit matrix. And then pass this matrix through the perceptual PHash hash. The result should have been just a 64 bit number.

Now we come to the full description of the concept.

How indexing works

1. Get key points and ORB descriptors, choose the number of required points in the image.
2. The resulting 32-byte descriptors are represented as a 16x16 bit matrix.
3. Convert the matrix to a 64-bit number using PHash.
4. Save 64bit prints in MySQL.
5. Select the required Hamming distance and start the HEngine daemon, which will perform the search.

How search works

Perform identical steps 1 - 3 of the indexing, but only on the requested image.
4. Make a request to the HEngine daemon, which returns all the hashes in the specified limit.
5. If required, filter out irrelevant results.

Figure 1. Hamming distance limit 7. Gray dots are key points found. Green - matching points. Red - matching standard ORB brute force.

What is the result?

As a result, we managed to solve several problems:
- find a way to quickly calculate the Hamming distance on a large data set
- get rid of big and uncomfortable surf
- increase the speed of selection of key points and their prints
- and also not to lose much in accuracy.

That allowed to find images by their fragment, as well as fuzzy half-duplicates without large computational resources.

Figure 2. Sweet by Friday

Considering that, depending on the settings, the described algorithm, using ORB binary descriptors, gives about 1,000 hashes per image.
On the basis of 1,000 images, 1,000,000 hashes are obtained in the database. Search and clustering of all duplicates takes one and a half minutes. Includes brute force search and matching hashes.

Links

[1] Ethan Rublee, Vincent Rabaud, Kurt Konolige, Gary R. Bradski: ORB: An efficient alternative to SIFT or SURF. ICCV 2011: 2564-2571.
[2] en.wikipedia.org/wiki/SURF
[3] en.wikipedia.org/wiki/Hamming distance
[4] phash.org
[5] habrahabr.ru/post/143667 Clustering duplicates in Yandex.Kartinki
[6] habrahabr.ru/post/211264 HEngine - hashes search algorithm within a specified Hamming distance on a large data set
[7] github.com/valbok/img.chk My search prototype for fragments

Source: https://habr.com/ru/post/211773/

All Articles