Hi, Habr! I present to your attention the translation of the article
“Image Similarity using Deep Ranking” by the author Akarsh Zingade.
Algorithm Deep Ranking
The concept of "
similarity of two images " was not introduced, so let's introduce this concept at least within the framework of the article.
The similarity of two images is the result of comparing two images according to certain criteria. Its quantitative measure determines the degree of similarity between the intensity charts of the two images. Using the measure of similarity, some signs describing the images are compared. As a measure of similarity is usually used: Hamming distance, Euclidean distance, Manhattan distance, etc.
Deep Ranking - studies the fine-grained similarity of images, characterizing the fine-dispersion ratio of similarity of images using a set of triplets.
')
What is a triplet?
A triplet contains a request image, a positive and a negative image. Where a positive image looks more like a request image than a negative one.
An example of a set of triplets:
The first, second and third lines correspond to the request image. The second line (positive images) looks more like request images than the third (negative images).
Deep Ranking Network Architecture
The network consists of 3 parts: triplet sampling, ConvNet and the ranking layer.
The network accepts triples of images as input. One image triplet contains an image request
$ inline $ p_i $ inline $ positive image
$ inline $ p_i ^ + $ inline $ and negative image
$ inline $ p_i ^ - $ inline $ which are independently transmitted in three identical deep neural networks.
The topmost ranking layer - estimates the triplet loss function. This error is corrected in the lower layers in order to minimize the loss function.

Let's now take a closer look at the middle layer:

ConvNet can be any deep neural network (in this article, one of the implementation of the convolutional neural network VGG16 will be considered). ConvNet contains convolutional layers, a maximum pool layer, local normalization layers, and fully connected layers.
The other two parts receive images with a reduced sampling rate and carry out the convolution stage and max pooling. Then the stage of normalization of the three parts occurs and at the end they are merged with the linear layer with the subsequent normalization.
Triplet formation
There are several ways to form a triplet file, for example, use an expert assessment. But this article will use the following algorithm:
- Each image in the class generates an image request
- Each image, except the request image, will form a positive image. But you can limit the number of positive images for each image request
- A negative image is randomly selected from any class that is not a request image class.
Triplet loss function
The goal is to train a function that assigns a small distance for the most similar images and a large one for different ones. This can be expressed as:

Where
l is the loss coefficient for triplets,
g is the gap coefficient between the distance between two pairs of images: (
$ inline $ p_i $ inline $ ,
$ inline $ p_i ^ + $ inline $ ) and (
$ inline $ p_i $ inline $ ,
$ inline $ p_i ^ - $ inline $ ),
f - embedding function, which displays the image in a vector,
$ inline $ p_i $ inline $ - this is a request image,
$ inline $ p_i ^ + $ inline $ - this is a positive image,
$ inline $ p_i ^ - $ inline $ Is a negative image, and
D is the Euclidean distance between two Euclidean points.
Deep Ranking Algorithm Implementation
Implementing with Keras.
Three parallel networks are used for the request, a positive and a negative image.
There are three main parts to the implementation:
- Implementation of three parallel multiscale neural networks
- Implementation of the loss function
- Triplet generation
Learning three parallel deep networks will consume a lot of memory resources. Instead of three parallel deep networks that receive the image of the request, a positive and a negative image, these images will be fed to the input of the neural network sequentially into one deep neural network. The tensor transferred to the loss layer will contain an image embedding in each row. Each line corresponds to each input image in the packet. Since, the image of the request, the positive image and the negative image are successively transmitted, the first line will correspond to the request image, the second - to the positive image, and the third - to the negative image, and then repeat to the end of the packet. Thus, in the ranking layer gets an attachment of all images. After that, the loss function is calculated.
To implement the ranking layer, we need to write our own loss function, which will calculate the Euclidean distance between the request image and the positive image, as well as the Euclidean distance between the request image and the negative image.
Implementation of the loss calculation function_EPSILON = K.epsilon() def _loss_tensor(y_true, y_pred): y_pred = K.clip(y_pred, _EPSILON, 1.0-_EPSILON) loss = tf.convert_to_tensor(0,dtype=tf.float32)
The packet size must always be a multiple of 3. Because the triplet contains 3 images, and the triplet images are transmitted sequentially (we send each images a deep neural network in series)
The rest of the code link