πŸ“œ ⬆️ ⬇️

Building signs and comparing images: global signs. Lectures from Yandex

We continue to publish lectures by Natalia Vasilyeva , Senior Researcher at HP Labs and Head of HP Labs Russia. Natalya Sergeevna taught a course on image analysis at the St. Petersburg Computer Science Center, which was created on the joint initiative of the Yandex Data Analysis School, JetBrains, and the CS Club.



There are nine lectures in the program. Already published:

Under the cut you will find the plan of this lecture, slides and a detailed transcript.


')
Why compare images:

Signs of images:

Popular distance functions for histograms:


Spatial arrangement of colors:


Adjacency matrices:

Texture: spectral:

Comparing textural attributes:

Form of objects: areas:

Conclusion
Full text transcript of the lecture
Today we will begin to talk about how to describe the images. That is, until now we have been engaged in various tasks mainly when we had to get an image back from an image and what can be done with a picture to somehow improve or modify it. From today we will start talking about how an image can be described by a set of as few numerical characteristics as possible, i.e. present as a numerical vector. What is more, it is desirable that this vector should be as small as possible and as best as possible describe what is shown in the picture.

Accordingly, why you may need to describe the image in the form of such a short set of vectors. (From the audience: - Compression.). Compression is a separate topic. Indexing and searching. Those. compare the images with each other instead of comparing them pixel-by-pixel, which is not very convenient, and secondly it will not allow us to somehow ..., i.e. There will always be some tricky measures of similarity to look for, some tricky metrics so that the similar in content images are so close.

It is clear that the human eye is not able to distinguish the difference in one pixel. But, nevertheless, it would be good for us if we, for example, solve such tasks as the classification of pictures. If we have a set of tags, for example: either these are images of the interior β€” in door β€” what is called in English literature, or out door β€” out-of-town, or it is some kind of landscape, sunset, or whatever. And we want to put in one pile and mark with one label even pictures that do not completely coincide, but which have something in common with each other. For example, if you look at these pictures here. (I tried to break them in pairs). It is clear that these two pictures are more intuitively more similar to each other than, say, these, with trams, and this pair is also less similar, this pair - but they have something in common. Those. if we, for example, solved such tasks as image search and we would have a picture or even a set of words as a request, if we wrote in the search bar that we want to find images with trams. We need to return all the pictures that depict this or that type of tram and need to somehow match the annotation of the tram. Actually, how can we do this today a little talk.

And there is still such a separate task, which is sometimes called the name (English) - image parsing. It is a bit like an annotation task, in the sense that it is also an attribution of various tags to a picture. But unlike annotation, if annotation sets itself the task of assigning tags that describe the image as a whole. That is, for example, if we are talking about some kind of picture that it has a sky, a road, an ocean shore, and all these annotations are attributed simply to this picture as a whole. That image parsing combines the recognition of objects, i.e. we have to highlight here - this is a machine, but at the same time not only the description of some object things, but we can also say that this is a sunny day.

If we talk a little more detail about each of these tasks. If we have a task to search for images by content, i.e. when no other information on the picture is available to us, except for the value of its pixels, the date is unknown, the name is unknown, no one has ever written any annotations to this picture, then, usually, the picture will also be a request in such a system, or just some a sample and you want to find something similar, or you (if someone is capable and talented) draws a sketch of the picture that he wants to ideally find. There is a certain collection of images. Next comes the actual pair-wise comparison, well, in general, very rough outline, of our request with each of the pictures in the database. Then the search result is returned, i.e. Only those images are selected which, by the use of our measure of similarity, turned out to be close to the original given image. In real life, it's not quite like that. No one compares the request with the millions of photos stored in the database. But we will talk about this a little later. That is how indexing actually occurs. Today, we will be interested in the issues of representing a picture as a feature vector, i.e. in the form of a set of numbers and various measures of similarity - how can we compare these same vectors.

During the classification, images are also compared, i.e. What is the task of classification all of you, I hope, know. In general terms, we have a certain training set, consisting of elements of some samples about which we know labels, i.e. correlating them to some classes. For example, here it can be an open landscape, a closed landscape, a city view and an interior. It can be, I do not know, any categories. Then the classifier model is trained during which, if we speak very roughly by superficial words, there is a comparison of the features of the images within each group (well, it doesn’t have to be images, just today we are talking about them, or maybe anything) the features of the object - tags. Those. we need to understand what attributes of an object are responsible for being able to attribute it to a particular class. The output is a classifier model, such a stage of learning, which usually occurs off-line. During testing, i.e. when new objects come to us, about which we do not know what class they belong to, and we want to understand this, i.e. for the machine to decide for us - just the object is being compared, i.e. comparison of the model that we built and the classifier predicts the value. Those. on the basis of the data on the basis of which he was trained, he predicts a class at the output.

To detect objects in general, we also need to compare pictures. The truth here is usually not compared to the whole picture, but only some fragments. Usually this is also solved using the classification problem. Those. not the whole image acts as a training set, but fragments that depict a particular object. And then the task is to find a fragment in the images from the database in which there is one or another object. Those. we do not need to compare the whole picture, but some fragment of the picture. Usually, as a solution to object detection problems, it is required to position an object of a certain class in the picture. One of these most common tasks is for example finding pedestrians, finding vehicles in order to say make an unmanned vehicle.
Announcing, as I said, is the task of assigning a set of tags. Again, in general, this is solved through the classifier, i.e. we have a certain teaching set, we have pictures to which tags have already been assigned. Tags act as class names. And then we try to classify each incoming picture into one of these classes. We get a multi-class (multi-class) classification - when one image can be assigned simultaneously to several classes, several labels can be assigned at once.
How to compare pictures?

In all these tasks we need to compare them. In fact, the mechanisms by which the images are compared - they are one for all the above tasks. Well, slightly different. We will talk mainly about how to describe the image as a set of signs, as if the whole image, and how to describe the fragment, and how to find those points, the fragment that may be most interesting is more informative. Next we will talk about how to compare these same sets of signs among themselves.

In general, if we speak about the signs of images, they can be divided into textual and visual ones. Textual refers to all that we can extract, not from the picture itself, but from what is around it. These can be tags, annotations, creation date, file name, etc. Those. until recently, let's say, the search for most search engines from pictures of the same Google and Yandex worked simply by the environment. And now, in fact, they use information quite strongly. Those. they are trying to index not the content of the picture, but the text surrounding the picture. If there is an image on the page, they index the text around this image and pull out the image, context, the environment of which is relevant to the request.

The creation date can also be very useful especially in combination with features built according to the content, the so-called visual features. And often it allows ... for example, if we have photographs taken again on vacation - they usually have a date and we have the task of identifying groups of duplicates. If you are great experimenters in photographing and try to photograph the same place several times with different settings, in order to choose the best photo, then this can be done automatically by combining just the label by date and visual content. Those. if suddenly you happen to have two photos with completely different dates in one class, then, most likely, you need to smash them.

Today we will talk only about visual signs, namely about color, texture and about form. They can also be spread. Those. You can look at the picture in several perspectives. You can describe the color, texture and shape and how it is all located in the space in the picture and you can describe the whole picture, i.e. describe the color of the entire image. And you can only describe the color of a fragment and the texture of a fragment. They usually speak about global and local signs. Global features are those that describe the entire image. Local - those that describe only a fragment.

For a similarity search, of course, most often, global signs are used, or some, rather large number of local, i.e. counted not in all points of the image, but in some predetermined ones. For a search task, let's say fuzzy duplicates or searches by fragment, consider local signs more often, because they are trying to match the fragment to the fragment, and not the whole image, to the whole image.

Today we will talk mainly about global signs. Although it must be said that the mechanism that is applied to the description of the entire picture, i.e. how to build a numerical vector over the entire image - it can be applied, of course, to some fragment. Those. if we know which fragment we want to describe, then we can continue using the same method to describe not the whole picture, but only some fragment.

The set of numbers that the image usually describes is called a feature vector - it is a certain set of parameters that reflects the features of our image. And the task is to make this vector as small as possible so that as few numbers as possible describe our image. But at the same time, the main information about what is shown in the picture is contained in this description. Unlike, let's say compression algorithms are not required here (for those tasks about which I will speak) the subsequent recovery of the image is not required, so its presentation can really be squeezed quite strongly. Those. we only need to understand, taking two different vectors, and comparing them to each other - if they are similar, this means that the pictures on which they were built should be similar.
If we specify a certain metric, which is not always actually the triangle inequality metric quite often fails for functions that are used to compare these very vectors (i.e., I prefer to call the similarity or distance functions) to compare vectors, then we get the space signs - a certain way how to build a set of numbers on the picture, plus a certain function with which you can compare vectors.

Most often in tasks, especially, such as search and classification, several completely heterogeneous sets are used at once. Those. if we are talking about color, there are separate attributes that describe color, individual attributes that describe texture, shape, and so on. Then it rises, respectively, on each of these spaces its own similarity function is defined. Next comes the question of how we combine the whole thing. The simplest solution is to simply glue all the vectors together and set some new similarity function that will calculate the distance between such one long vector. But so rarely do. Most often, the distance in each of the spaces is calculated, and then either a linear combination is built or there are even more clever ways to synthesize it all. If we have time at the end, we'll talk about that too.

All that I’m going to talk about today is basically signs of color, texture and shape. Spatial selected with a dotted line, because again, each color can be described as the color of the entire image, and you can try to tie the color to how it is located in the image. The texture is the same and the shape is the same. Those. we can assume that spatial signs are not something that stands out in a separate class, but some additional plus to the main three classes.

We start with color. It is easiest to describe the color using a histogram, i.e. just look at the distribution for each of the color channels. It must be said that color histograms are still used to solve a very large number of problems, because they are very simply considered, they are very understandable. They do not always solve; they are capable of solving very complex tasks. But there, let's say, the simplest search is similar, when you have images, when the base is not very large, when the images differ from each other in color composition - it is solved. Or when, on the contrary, the base is very large and you have the opportunity to find almost copies of this picture - then, also, a histogram search in general will somehow work.

The histogram as calculated, we have already discussed. Further, in order to compare the histograms, a very large set of metrics is also used. This is either the same or the same equivalent of the histogram difference, i.e. when we take a subtraction, i.e. we have a histogram representing one image, a histogram representing the second image, and we just look at the difference of these vectors. Those. our differences are represented by a vector. This may be the Euclidean distance, this can be taken maximum. You can think of the so-called chi-square or sometimes even use the so-called (in Russian I don’t even know how to say - orc movie distons) - this distance, in general, works well if your histogram is constructed in such a way that you have adjacent gaps always correspond to nearby close colors. But most often they actually use either a simple intersection, i.e. el one or square, because they are the easiest to count. Chi-square is also used of course, but less often. Chi-square is actually a measure of comparing two distributions, how to bring one distribution to another, the formula will be there later. Chi-square is good because it is a relative measure, it calculates what needs to be done, roughly speaking, with one distribution in order to bring it to the form of the second (well, if on fingers).

Another, not very accustomed, I must say (I don’t know why), the way to describe the color is the so-called statistical model, which suggests describing the distribution of color in the picture not using a histogram, but using different moments, i.e. statistical moments. If we have, imagine the color as a certain distribution of a random variable and calculate the mat - expectations (indiscernible), dispersion, the moment of different covariance and higher order moments and then build it ... Not a clear answer from the audience. Answer - No, just discrete. We have a set of observations that we see in our image, i.e. a discrete distribution is obtained.
Accordingly, the feature vector is constructed simply as a set of these of different moments, and in particular in the article in which it was proposed to use such a model, and if anyone uses it later, they used just ale one to compare vectors.

In the popular distance function for histograms, as I have already said, this is the intersection of histograms, which is given or like this (in fact, it is equivalent to ale one, that is, we simply calculate the difference between these values ​​and it will be about the same) and chi-square which is calculated as follows. It’s probably not worth remembering, just so you have it at hand.

What difficulties can be encountered when calculating a histogram, what are the disadvantages of a histogram. I told you everything, they are so good. First you need to understand how ... space. Despite the fact that the image is discrete, and the color is also discrete in the digital representation anyway, if we take an ordinary full-color image, i.e. 16 million colors, respectively, in the histogram of 16 million spaces, is a vector of 16 million long. This is not very convenient. Not only is it very difficult and you need a large amount of space in order to store these feature vectors, it is expensive to compare them. , , . – . , .. , , , . , , , .. , , , ( ).

, , – . , . . – . , , – . ( ) – .

What is bad? , , -, . Those. .
, .. , - . , , , , .

. , , . , . . . , , . , . , , , – , .

. Those. , , … , , .. , . , , , .. . - . – , , . , . , , .. . , , , . … .

, , - , , .. – - , , .. - , , .

, , β€” , , , ( Β«GridsΒ») β€” , , , . – , , . . , , .

, , . , - . , , Earth mover's distance, , β€” . Those. , . Β«Earth moverΒ» – , .

, .. , , Earth mover, , , . , , , , – , . . . - .

? , .. , . , , . Those. . , , , , .
, . Those. β€” . , .. .

, . , , , . . . . , .
. ... . . .

. Those. , . , . Those. , , , , - , .. . Those. , . 0,5 R1, 0,5 R0 .. , , , R0 – . β€” . , . . - . .

, . , .. , . , , . , .

, ? , , . Those. . . . , .. , . – .

A C , , , … . , , . , - , . .. , . , , , . Can. , .. - , , .

, . . – – . . - .

, – , . Those. . Those. . – . . . , . . – , . – , .. . , , . , , , , .

( ) . , , . . , . . , , , , , . . Those. . , , . - . , , . , – . . Those. , . . , , . Those. . , .

. , .
. . . . - , ? , . … – , – . . Those. - . . - . . -, . Those. , , , . , . - .
– , . . , , , , .

– . - , , - . , , . , . Those. - . , , , - , .. Those. ( , .. ) – - , .

– , .. . . . . Those. - .

, () - , , (, ) .

. β€” . , - . . . , . . , , . , . β€” 211 , , , . . , . , , . , , , ? Good. , .. β€” , , . , .. , – .. . , – , , . , . – .. . , . . 28, . β€” .

. , β€” , , , . .

- . , , .. . Those. , , , . , . Those. . . .

– , (.. , ) – . ( ) β€” , . , : , , , , , . - . Those. , - , , - , .. , . , . Those. , . . ., .. , . , . . .

, . , , , β€” , .

– . . , -. . , -, , β€” - , , .

- - , β€” . . Those. . - , - , β€” .. . , ( – 0 – .. . .. . , .. – , . - , - , -, , …, .. , - . .. … … , .

. , – . , , . , , .. - ( β€” . ) , β€” . , , , .. , . , .

. Those. , . Those. - , .. , , . , – . Those. , . . , , , - β€” , . , , , – , .. , . , , – , . , , .

, . , , . , . . - . , , … … .. … . Okay. , … , . , . – . , , – - . – , . , … .. . , .. . Those. - , - .

? , , . , .. . , , , . ( ). , - . , - . -. … .. , β€” - . ( , ), . , . , , . Those. , β€” , . .

. Those. – , . – .. , . , , .. β€” ( , ) , - .
, … .. . , - , . Those. , , .. - . – V0. Those. , .

, , . , , . , , , . , , , . , , . Those. - , .

, – , . . , , , , . . , , . , … , , , , .

, … .. , , , , . .. , -. , . , . – .

Those. , , . , , .. . .

, .. , , , , .. , , . Those. . Those. . , , .. – . , . Those. , , , , , .. , – . .

. Those. … .. , , , .. , . . . Those. .

, , .. , . . , . , – . , , , .. .. . Those. , .
Those. , … (V0 V1), (V0) – (W0).

, , … .. , . . , , , , , . , . . , . Those. . Those. ( , -, , ).

, . , .. , . , . – , . Those. , , . . – , .. .

– . , . Those. . Those. . Those. – , . . Those. , , , - . , , , - , . . , - , - – . Those. , , . , . , . Those. , .. . – . , .

, , , , . – . . – .

. . . Those. . Those. – … .. , . , , , -… , , , .. , – … , .

Those. , , , , -. , . , .. β€” . . Those. … , , , . , , . . Those. , , . , . .

, . , .. , .. , , , . . – , , , .. . – . , .

, , , . , , – , . , . , – . Those. , - , - . . , . . , .

, . . , . , , , , - . Those. , , . - .

β€” , . Those. … … … .. , . . , , ( , .. - . .

. . , . , , .
. , , , , . , . Those. … . Those. , . , , , .. . . . Those. .

, , , . - , .
… , , , .. , .

, … … .. , . , .
… , , , . , .

. - . , , - , . , , . , , . . , , – . .

,
β€” - , , . those. .
, , , , , . , , , , , … , , β€” . , , , .

, . . , . , , , – , . , .. , . . , , , , . , .

Those. , , , , , , . . . Those. . β€” , , , , , , , , , – , , .. Those. . , , , - , . .

. . , , . Those. , . . , . , , . Good.

. , , . Those. . , . . . . . – . – . . Well, and so on. , .

, . Those. , .. , , . , , .

. . Those. … .. , , , – , . Those. , , , .

. . , . , . Those. . , . , – , - … . . . , .

, ( , – ). , , , – . Those. , , , – , , , .

- . , .
. -. . , , . Those. . – . – . β€” . – .. . , , . .

– , , . Those. , . .. . , ( ) - . , . , , , . Those. . - . Those. , , , , .
, , - . , , .

, , , , , . . , , , .. , . Those. , , , .
, . , - . , , , .
. . , . , , , , , - .

Source: https://habr.com/ru/post/255627/


All Articles