📜 ⬆️ ⬇️

How to deal with repost or a couple of words about perceptual hashes

This publication focuses on perceptual image hashes and their use (for example, searching for duplicates).

Perceptual hash algorithms describe a class of functions for generating comparable hashes. They use various properties of the image to build an individual “print”. In the future, these "prints" can be compared with each other.

If the hashes are different, then the data is different. If the hashes match, then the data is most likely the same (since there is a possibility of collisions, then the same hashes do not guarantee the coincidence of the data). This article focuses on several popular methods for constructing perceptual image hashes, as well as a simple way to deal with collisions. Anyone interested, please under the cat.

Overview


There are many different approaches to the formation of perceptual image hashes. All of them are united by 3 main stages:
')


In this publication, we will not deal with the implementation of the above algorithms. It is an overview and tells about different approaches to the construction of hashes.

Perceptual Hash Algorithms


We will consider 4 different hash algorithms: Simple Hash [4], DCT Based Hash [1], [11], Radial Variance Based Hash [1], [5] and Marr-Hildreth Operator Based Hash [1], [6] .

Simple Hash (aka Average Hash)

The essence of this algorithm is to display the average value of low frequencies. In images, high frequencies provide detail, while low ones show structure. Therefore, to build such a hash function, which for similar images will produce a close hash, you need to get rid of high frequencies. Principle of operation:



The resulting hash is resistant to scaling, compressing or stretching the image, changing the brightness, contrast, color manipulation. But the main advantage of the algorithm is its speed. To compare hashes of this type, the function of the normalized Hamming distance is used.

image
Source image

image
The resulting "imprint"

Discrete Cosine Transform Based Hash (aka pHash)

The discrete cosine transform (DCT, DCT) [7] is one of the orthogonal transforms, which is closely related to the discrete Fourier transform (DFT) and is a homomorphism of its vector space. DCT, like any Fourier-oriented transform, expresses a function or signal (a sequence of a finite number of data points) as a sum of sinusoids with different frequencies and amplitudes. DCT uses only cosine functions, in contrast to the DFT, which uses both cosine and sine functions. There are 8 types of DCT [7]. The most common is the second type. We will use it to build a hash function.
Let's see what the second type of DCT is:

Let x [m], where m = 0, ..., N - 1 is the sequence of the signal of length N. We define the second type of DCT as

image

This expression can be rewritten as:

image

where c [n, m] is an element of the DCT matrix at the intersection of row n and column m.
DCT matrix is ​​defined as:

image

This matrix is ​​very convenient for calculating DCT. DCT can be calculated in advance for any necessary length. Thus, DCT can be represented as:
DCT = M × I × M '
Where M is the DCT matrix, I is the image of a square size, M 'is the inverse matrix.

Low-frequency DCT coefficients are most stable to image manipulation. This is because most of the signal information is usually concentrated in several low frequency coefficients. As a matrix I, an image is usually taken, compressed to a size of 32x32 and simplified using various filters, for example, bleaching. The result is a DCT matrix (I), in the upper left corner of which the low-frequency coefficients are located. To build a hash, the left upper block of frequencies 8x8 is taken. Then a 8 byte hash is built from this block by finding the average and building a chain of bits (as in Simple Based Hash).
Here are the steps to build a hash using this algorithm:

The main advantages of such a hash: resistance to small rotations, blurring and image compression, as well as the speed of comparison of hashes, due to their small size. To compare hashes of this type, the Hamming distance function is used.

Radial Variance Based Hash

The idea of ​​the Radial Variance Based Hash algorithm is to build a ray dispersion vector (LVD) based on the Radon transform [5]. Then DCT is applied to the LVD and a hash is calculated. The Radon transform is an integral transform of a multi-variable function along a straight line. It is resistant to image processing through various manipulations (for example, compression) and geometric transformations (for example, rotations). In the two-dimensional case, the Radon transform for the function f (x, y) looks like this:

image

The Radon transform has a simple geometric meaning: it is an integral of the function along a straight line perpendicular to the vector n = (cos a, sin a) and passing at a distance s (measured along the vector n, with the corresponding sign) from the origin.

image

To expand the Radon transform for discrete images, the linear integral along the straight line d = x ∙ cos α + y ∙ sin α can be approximated by summing the values ​​of all pixels lying in a line one pixel wide:

image

It was later found that it is better to use the variance instead of the sum of the pixel values ​​along the projection line [6]. Dispersion handles brightness discontinuities along the projection line much better. Such brightness discontinuities appear because of edges that are orthogonal to the projection line.
Now we define the ray vector of dispersion. Let Γ (α) be the set of pixels on the projection line corresponding to a given angle. Let (x ′, y ′) be the coordinates of the central pixel in the image. x, y belong to Γ (α) if and only if

image

Now we define the radial vector variance vector:
Let I (x, y) denote the pixel brightness (x, y), #Γ (α) is the power of the set, then the ray vector of dispersion R [α], where α = 0.1, ..., 179 is defined as

image

It is enough to construct a vector with 180 angle values, since the Radon transform is symmetric. The resulting vector can be used to build a hash, but this algorithm proposes some other improvement - to apply DCT to the received vector. The result is a vector that inherits all the important properties of DCT. The first 40 coefficients of the obtained vector, which correspond to low frequencies, are taken as a hash. Thus, the size of the resulting hash is 40 bytes.
So, we list the steps for constructing a hash using this algorithm:

For comparison, hashes of this type are used to search for the peak of the mutual correlation function.

Marr-Hildreth Operator Based Hash

The operator Marra-Hildreta [16] allows you to define the edges on the image. Generally speaking, the border in an image can be defined as an edge or contour that separates adjacent parts of the image, which have relatively different characteristics in accordance with certain features. These features can be color or texture, but the most commonly used is a gray gradation of the image color (brightness). The result of the definition of boundaries is a map of boundaries. The border map describes the classification of borders for each image pixel. If the boundaries are defined as an abrupt change in brightness, then derivatives can be used to find them or a gradient.
Let function image denotes the brightness level for the line (one-dimensional array of pixels). The first approach to determining the boundaries is to find the local extrema of the function, that is, the first derivatives. The second approach (the Laplace method) is to find the second derivatives image [one].
Both approaches can be adapted for the case of two-dimensional discrete images, but with some problems. To find derivatives in a discrete case, an approximation is required. In addition, noise in the image can significantly impair the process of searching for boundaries. Therefore, before determining the boundaries to the image, you must apply a filter that suppresses noise. To build a hash, you can choose an algorithm that uses the Laplace operator (2 approach) and the Gauss filter.
We define a continuous Laplacian (Laplace operator):
Let be image determines the brightness of the image. Then continuous Laplacian is defined as:

image

Zeroing image and there are points corresponding to the boundary of the function image , since these are the points at which the second derivative vanishes. Various filters (discrete Laplace operators) can be obtained from continuous Laplacian. Such a filter image , can be applied to a discrete image by using the convolution of functions. Laplace operator for image image can be rewritten as:

image

where * denotes the convolution of functions. To build a boundary map, you need to find the treatment of zero discrete operator image .
Now consider the Marra-Hildreth operator. It is also called the Laplacian from the Gaussian filter (LoG) - this is a special type of discrete Laplace operator. LoG is constructed by applying the Laplace operator to a Gauss filter (function). The peculiarity of this operator is that it can select borders at a certain scale. Scale variables can be varied in order to better define the boundaries.
The Gauss filter is defined as:

image

Convolution with Laplace operation can be swapped, because the derivative and convolution are linear operators:

image

This property allows you to pre-calculate the operator image , because it does not depend on the image ( image ).
(operator Marra-Hildreta, Laplacian from the Gauss filter, LoG). LoG hc (x, y) is defined as:

image

To use LoG in discrete form, we will make a discretization of this equation, substituting the desired scale variable. By default, its value is taken as 1.0. The filter can then be applied to the image using discrete convolution.
Define the discrete convolution:
Let x, y, z be the pixel width, length and depth of the image I.
The result of the R convolution of the image I with the mask M is defined as:

image

Now we list the steps of the algorithm that uses the Marr-Hildret operator to build the hash:

The size of the resulting hash is not small, however, the comparison of two hashes takes a rather short time compared with the Radial algorithm, since the function of the normalized Hamming distance is used. Also, this algorithm is sensitive to image rotation, but is resistant to scaling, dimming, and compression.

Perceptual hash value comparison functions


Hamming distance

Hamming distance determines the number of different positions between two binary sequences.
Definition:
Let A be an alphabet of finite length. image - binary sequences (vectors). Hamming distance Δ between x and y is defined as:

image

This method of comparing hash values ​​is used in the DCT Based Hash method. The hash is 8 bytes in size, so the Hamming distance lies in the [0, 64] segment. The smaller the Δ value, the more similar the images.
To facilitate comparison, the Hamming distance can be normalized using the length of the vectors:

image

The normalized Hamming distance is used in the Simple Hash and Marr-Hildreth Operator Based Hash algorithms. Hamming distance lies in the interval [0,1] and the closer Δ to 0, the more similar the images.

Peak of the mutual correlation function

The correlation between the two signals is defined as:

image

where x (t) and y (t) are two continuous functions of real numbers. The function rxy (t) describes the displacement of these two signals relative to time T. The variable T determines how far the signal is shifted to the left. If the signals x (t) and y (t) are different, the function rxy T is said to be mutually correlated.
Define the Normalized Mutual Correlation Function:
Let xi and yi, where i = 0, ... N - 1 are two sequences of real numbers, and N is the length of both sequences. IAF with delay d will be defined as:

image

where mx and my denote the average value for the corresponding sequence.
The peak of the mutual correlation function (PCC) is the maximum value of the function rd, which can be achieved on the interval d = 0, N.
PCC is used to compare hash values ​​in the Radial Variance Based Hash algorithm. PCC ∈ [0,1], the greater its value, the more similar the images.

A few words about the practice


We considered 4 different approaches in the implementation of hash algorithms. The perceptual hashes are wide in scope, from searching for similar images to detecting spam in pictures.

In our project, we use a perceptual hash to identify duplicate images. At the same time, it is often possible to successfully compare among themselves images containing watermarks (obtained from different sources) or images with cropped edges.
Radial Variance Based Hash can be considered the best algorithm for finding duplicates, but it is extremely difficult to use it on large amounts of data, due to the very time-consuming comparison of hashes.

For ourselves, we chose DCT based Hash. We use 64-bit hash, this is quite enough to find duplicates, but such a small hash size inevitably leads to collisions. With collisions, we struggle to construct a histogram (spectrum) for the image.
Each component of the spectrum means the relative number of colors of one or another range in the image, in other words, it shows the distribution of colors in the image. It looks like this:

image

Thus, we first look for images comparing the hash, and then compare the histograms, if the histograms are significantly different, then such an image is not considered to be similar. The histograms themselves can be compared by counting the cross-correlation as well as by comparing Radial Variance Based Hash using the Pearson cross-correlation:
image

If this topic is interesting to the community, in the next article I will talk specifically about our implementation of DCT hash, as well as how to find the Hamming distance.

Bibliography
[1] Egmont-Petersen, M., de Ridder, D., Handels, H. “Image processing with
neural networks - a review. ” Pattern Recognition 35 (10), pp. 2279–2301 (2002)
[2] Christoph Zauner. Implementation and Benchmarking of Perceptual Image
Hash Functions (2010)
[3] Locality-sensitive hashing.
en.wikipedia.org/wiki/Locality-sensitive_hashing
[4] Simple and DCT perceptual hash-algorithms.
www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-
It.html
[5] Standaert, FX, Lefebvre, F., Rouvroy, G., Macq, BM, Quisquater, JJ,
Legat, JD: Practical evaluation of a radial soft hash algorithm. In
Proceedings of the International Symposium on Information Technology:
Coding and Computing (ITCC), vol. 2, pp. 89-94. IEEE, Apr. 2005
[6] D. Marrand, E. Hildret. Theory of edge detection, pp. 187-215 (1979)
[7] en.wikipedia.org/wiki/Discrete_cosine_transform
[8] en.wikipedia.org/wiki/Median_filter
[9] en.wikipedia.org/wiki/Median
[10] en.wikipedia.org/wiki/Gaussian_blur
[11] Zeng Jie. A Novel Block-DCT and PCA Based Image Perceptual Hashing
Algorithm. IJCSI International Journal of Computer Science Issues, Vol. ten,
Issue 1, No 3, January 2013
[12] de.wikipedia.org/wiki/Marr-Hildreth-Operator

Source: https://habr.com/ru/post/237307/


All Articles