Make3D from one photo, part 1

The project from Stanford University (now Cornell University ) " Make3D " is remarkable for the fact that it has set itself the task of restoring a three-dimensional model of the scene from just one photo that has not yet become typical. Until now, in order to achieve a similar result, developers restored three-dimensional information by combining several (two or more) images of the same object. In this case, it was demonstrated that a significant amount of information is contained in the monocular signs ( monocular cues ) of the image itself, which were often ignored before. In practical implementation, it has already been possible to achieve satisfactory results for more than 60% of arbitrary photographs provided and evaluated by third-party users of the system when conducting its tests.

The publication consists of: Part 1, Part 2
It is published to satisfy curiosity, in order to ~~expose the magic~~ to make it clear how it works.
')

Content:

Part One ( current ):

Brief description of the algorithm
List of used theories and algorithms
Ising model or where did this formula in MRF come from
Effective image segmentation on graphs
We want 3D Model and Location of polygons
Watch will have wider ...
Monocular image features
Borders and refractions
MRF Input and Output
What will MRF do? Target: min (distance error)

Part Two ( here ):

Wish list is ready, we stuff it in the Model MRF
1. Local features: Preliminary determination of distance by local features
2. Connection: Connection
3. Co-planar: Coplanarity
4. Co-linear: Collinearity
Briefly about learning MRF
Briefly about the MRF Solution
Source code and model parameters studied
Program execution, testing
Total
Few files

Brief description of the algorithm

The image is divided into superpixels , small segments of the image that are similar in texture. Homogeneous objects and even flat surfaces, as a rule, are divided into several superpixels, but, nevertheless, superpixels lie entirely within a single object or surface, if the surface has a pronounced boundary.

The algorithm for each super pixel tries to find out its position (depth) and orientation in three-dimensional space relative to the point from which the photo was taken, i.e. find the surface (plane) on which it lies. This surface can have any arrangement in space, not necessarily horizontal or vertical, as suggested in similar studies conducted earlier.

The project uses preliminary training for the interrelationships of the totality of the set of signs of superpixels (image sections) and distances to them. The teaching algorithm uses the MRF model ( Markov field ), taking into account the restrictions on the relative distances between adjacent superpixels, i.e. Considering that with some probability, two neighboring sites are likely to be approximately at the same distance from the observation point, or they may even be on the same plane than belong to different objects that are far apart in space (such as the fence and the background behind it).

After processing a segmented on superpixels image in a pre-trained MRF , the algorithm receives at the output the position and orientation of each super pixel. This is enough to build a three-dimensional model of the scene, the texture of which is the photograph itself.

Pic.1 Result of photo processing in Make3D

List of used theories and algorithms

In the method of restoring a three-dimensional model of the scene, the following theories and algorithms are used:

Image segmentation for splitting an image into super pixels
Apply filters to an image to detect monocular signs.
Markov random fields ( MRF ) for modeling monocular signs and the relative position of different parts of the image
Selection of contours and borders on the image ( edge detection )
Elements of computational geometry , linear programming , etc.

Ising model or where did this formula in MRF come from

You can quickly flip through, this is just an explanation of why the formula looks like this in MRF and in " Make3D ". In general, historically it happened ...

The Ising model served as the Markovskaya prototype, so let's start with it. This is a mathematical model of statistical physics , designed to describe the magnetization of the material.

In the physical Ising model, each vertex of the crystal lattice is assigned a number, called spin, and numerical, denoted as “ +1 ” or “ −1 ”, which determines the direction of the spin: “ up ” or “ down ”. Spin is the intrinsic angular momentum of elementary particles, which has a quantum nature and is not associated with the movement of a particle as a whole.

Pic.2 One-dimensional Ising model

Let us consider the simplest one-dimensional case when the lattice sites are located along one straight line. The model describes the probability distribution in space.

all possible substance configurations ω = (ω_0, ω_1, ..., ω_n)

where

- the number of elementary particles, $ω_i = {+ 1, -1}$ spin of the i -th particle. Such a distribution is called a random field — it is a random function defined on a set of points of a multidimensional space. P (ω)

= how likely it is that the substance will be in a state where all nodes have equal spins ω = (ω_0, ω_1, ..., ω_n)

Each of

possible locations of spins

, attributed to the energy resulting from the pairwise interaction of the spins of neighboring atoms and the external field:

The total energy of the system in the Ising model:

In the formula, the first sum is taken over all pairs of neighboring nodes, and equals the interaction energy of the spins of the nodes. Ising simplified the model, taking into account the interaction only between neighboring nodes located in close proximity. The exchange interaction constant J is a characteristic of the material under consideration and describes the interaction energy of spins. If J> 0 , then describes the attraction energy of a ferromagnet ( attractive case ), when neighboring lattice sites try to line up in one direction, i.e. collinearly , with values of J <0 describes the repulsive energy of an antiferromagnet ( repulsive case ), when neighboring lattice sites try to line up in opposite directions, i.e. coplanar . The second sum describes the effect of an external magnetic field of intensity H , where m - characterizes the magnetic properties of a substance.

In the case when J> 0 , the first sum will be minimal, in the case when all the nodes of the crystal lattice will be aligned in the collinear direction. The second sum reaches a minimum when the direction of the spins of the nodes of the crystal lattice coincides with the direction of the external field H.

We know from physics that any system brought out of equilibrium tends to the state with the least energy . Therefore, sooner or later, the model will “calm down” and take a position in which the total energy U (ω)

will be equal to the smallest value ( min ). This will be the most likely state of her ( P (ω)

- max ).

Probability distribution in space

all possible substance configurations in the ising model, subject to the distribution given by the so-called Gibbs measure distribution function:

Distribution in the Ising model:

where k is the Boltzmann constant, T is the temperature of the substance, Z = ∑_ωe ^ (- 1 / kT U (ω))

normalization value.
We do not need these values, it is important that the formula has this appearance and has been working since the beginning of the 19th century .

For a more convenient form of recording the probability distribution in the model, for each i -th lattice node we associate its ( personal ) own total energy:

Own energy of the Ising model node:

Gibbs distribution function ( Gibbs measure ) is interesting from two positions. The first is related to entropy, which probably led to its widespread use in statistical studies.

Pic.3 Two-dimensional Ising model

The second important property of distribution in this form, from the point of view of probability theory, is the following. Let be N_j

this is the set of all neighbors of j , i.e. many nodes in the vicinity. Then:

Markov type property:

This property is called the Markov type property. The probability that a spin is at a point ω_j

facing a , for given spins in all other nodes of the crystal lattice, is equal to the probability in which only spins in adjacent nodes are taken into account ω_k, k ∈ N_j

. This property is called a local characteristic , and the distribution with a similar property is called a random Markov field .

The Markov network was presented as a generalization of the Ising model, and in it the probability distribution function

can be written in multiplicative form , as the product of all the proper energies of the lattice nodes:

Model in multiplicative form:

In practice, the main purpose of applying a random Markov field is to detect the most likely configuration.

random field ω: w*=argmaxw(P(w))

.

The main difficulties associated with the use of this method are the correct choice of parameters controlling the power of spatial interaction.

Now we have a formula, on its basis the MRF model for “Make3D” will be built ...

Effective image segmentation on graphs

A picture is worth a thousand words:

Segmentation example with various parameters

These large monochrome patches are superpixels, in the terminology of the authors of “Make3D”. More details are available in an accessible form in the article " Effective segmentation of images on graphs ".

We want 3D Model and Location of polygons

We want to end up with a 3D model . The restored three-dimensional model is represented by a set of polygons and a texture , i.e. parts of the original photograph superimposed on these polygons. Polygon has an arbitrary location and orientation in space.

Pic.4 Superpixel orientation and orientation

More formally, the three-dimensional location of the polygon will be determined by the plane defined by the parameters α∈R^3

. For any point q∈R^3

belonging to this plane α , the equality α^T q=1

. The shortest distance from the observation point, the point from which the photograph was taken, to the plane along the perpendicular lowered onto it is: 1 / | α | . Normal vector α=α/|α|

uniquely determines the orientation of the plane in space.

If we are given a single vector R_i

(called ray R_i

) connecting the observation point ( camera center ) with an arbitrary pixel i in the image belonging to the plane α , the distance from the observation point to this pixel i is given by the formula: d_i=1/(R_i^T α)

. The distance to the superpixel will also be determined.

Watch will have wider ...

The person is well distinguished by monocular signs of the image, helping him to restore the three-dimensional model of the scene and determine the distance to objects. He is at one glance at the photograph is able to notice differences in the textures of the image, color gradients, differences in colors, fuzzy contours and the out-of-focus of objects in the distance.

However, only local monocular signs are still not enough, because the blue sky and the blue object on earth have similar local monocular signs. Therefore, when reconstructing a scene, a person has to take into account not only local features, but also the relative interrelation of various parts of this image. Thus, a person can determine the distance to an object with not brightly pronounced signs, say, a uniformly lit large gray area, relative to other objects in the photograph. And determine that this is a wall and even approximately see its length.

Therefore, the MRF model takes into account not only local monocular signs ( monocular cue ), but also the relative relationship of different parts of the image:

Local features : local features of a super-pixel, monocular, in many respects affect its position and orientation in space.
Connections ( connections ): with the exception of the borders of various objects, it is most likely that adjacent superpixels are connected at the borders and in three-dimensional space
Coplanarity ( co-planarity ): if neighboring pixels have similar monocular features ( feature ), and do not have pronounced borders ( edge ) between them, then it is most likely that they lie in the same plane.
Collinearity ( co-linearity ): long straight lines, most likely, will also be long straight lines in three-dimensional space

These properties are taken into account by the model in the aggregate , with the imposed condition ( condition ) in the form of a " confidence level " ( confidence ) on each of these properties, i.e. as far as we can be confident in the calculated result. The level of trust is a different value depending on the local features of the image area.

But before turning to the MRF model ... what did the authors find so interesting in just one photo?

Image features

For each superpixel, the algorithm calculates a tuple of features ( features ) in order to identify the monocular cues mentioned above, as well as features for determining the boundaries and refraction of superpixels (objects in the image). In order to make the algorithm more stable and flexible, so that it can restore the scene from an arbitrary photograph, not necessarily similar to those on which the model was trained, a relatively large number of features ( features ) are calculated.

Monocular image features

For each superpixel i , statistical features related to image texture, shape, and position are calculated. In the project “Make3D” 17 filters are used. F_n (x,y)

where n = 1, ..., 17 :

9 Laws masks for calculating the average image intensity ( averaging ), spot detection ( spot detection ) and borders ( edge detection )
6 masks for detection of oriented borders ( oriented edge detection ), rotated by ( 30 ° × k ), k = 0, ..., 5 degrees
2 color channels in Cb and Cr in YCbCr format

Pic.5 Masks used in Make3D

Pic.6 Two channels of color Cb and Cr

As a result of applying filters to the superpixel i of the image I (x, y) , a 34-dimensional vector of features is formed, calculated as E_i (n)=∑_((x,y)∈S_i)|I(x,y)*F_n (x,y) |^k

where k = {2,4} , which corresponds to the energy ( energy ) and the excess ( kurtosis ). This gives 34 values for each super pixel. Additionally, for a superpixel i , 14 features are calculated that relate to its position in the photograph and form, like:

Pic.7 Features superpixel location

The " Make3D " project takes into account not only the features of a separate superpixel, but also takes into account the "context" in the form, the features ( features ) of the 4 largest neighboring superpixels , and this operation is repeated in 3 different scales . Thus, each superpixel contains in addition information and about the image around it, which leads to better results than if we confine ourselves to local features of the superpixel. In the end, a vector is associated with each superpixel that contains 34 * (4 + 1) * 3 + 14 = 524 elements, i.e. x∈R^524

Pic.8 The features of each super pixel are considered in context.

Pic.9 Highlighting image features at different scales.

Pic.10 At one level, each superpixel connects to 4 max neighbors

Borders and refractions

Parameters are used to define boundaries. $y_ij∈{0,1}$ defining the border between two super pixels (" edgel "). In view of the fact that the definition of boundaries is usually inaccurate, the model uses not discrete values, but more “soft” conditions for y_ij∈[0,1]

. More formally, equality y_ij

= 0 means the presence of an object boundary ( occlusion boundary ) or an unfold between superpixels i and j , and y_ij

= 1 means no, i.e. superpixels lie on a flat surface.

Pic.11 Image detection (edge detection) - edgels

Pic.12 Edge detection

In some cases, the presence of a large gradient or differential light, which are taken into account in the algorithms for determining the boundaries of objects ( edge detection ), is also a sign that these parts of the image belong to different objects. For example, the shadow of a building dropped on a lawn can cause such a difference, and the algorithms for determining the boundaries of objects can mistakenly determine the presence of a boundary and divide the lawn into 2 objects. Nevertheless, if we take into consideration more visual features ( visual cues ) of the image to find out whether there are boundaries between these objects, whether they are connected, coplanar or not, and not just gradients, then we can get better results.

In this work, an approach was proposed for determining the boundary of objects ( edge detection ), using training that takes into account the lighting, color and texture of image areas. Taking advantage of their result, the “Make3D” value y_ij

that determines the presence of a border between superpixels i and j is derived from the model with the distribution:

features ( features ) a pair of superpixels i and j
model parameter (after training)

If two adjacent superpixels in the image have drastically different features ( features ), then the person will most likely also perceive them as two different objects. Thus, the boundary between two super pixels with distinctly different features most likely represents the boundary of objects ( occlusion boundary ) or bend ( unfold ). To find out how much the features differ with the two superpixels i and j in “Make3D”, 14 different image segmentation is carried out: for 2 different scales and with 7 different sets of segmentation parameters to identify texture, color, and border features.

Pic.13 An example of a single image segmentation with different parameters.

Pic.14 y - 14-dimensional vector differences of super pixels for different segmentations.

In the 14-dimensional vector obtained after segmentation, ϵ_ij

each element ${ϵ_ij_k}$ is a sign: did both superpixels i and j go to the same segment when performing the i -th segmentation. After all, two superpixels, which turned out to be in the same segments in all 14 cases, will most likely be coplanar or interconnected. Conclusions about the presence of boundaries between superpixels, made on the basis of multiple segmentation, are more reliable than with a single segmentation. This vector ϵ_ij

is an input parameter when defining a “soft” border y_ij∈[0,1]

between two superpixels i and j .

Pic.15 Parameter y (right) indicating the presence or absence of a border

MRF Input and Output

Now you can sum up the intermediate result: 5 types of parameters are involved in the MRF
model . Input are:

X variables related to calculated features ( features ) of an image
Empirically calculated θ variables (parameters of the trained model)

The calculation also takes into account the conditional parameters (conditioned):

Parameters υ represents the “confidence level” ( confidence ) for the distance to the object, calculated based only on the local properties of the image section
The y parameters indicate the presence or absence of a clear boundary between the objects in the image. These parameters are used to identify coplanarity and super-pixel connections.

Calculated in the MRF parameters are:

Parameters α , specifying the location and orientation of the planes in space, on which the polygons are located in the three-dimensional model

Pic.16 Model MRF combines all the features of the image, to find 3D

What will MRF do? Target: min (distance error)

When constructing a three-dimensional model of the scene, the relative error in determining the distance is the most significant. For the true distance ( ground-truth depth )

and estimated in the model

, the relative error is defined as ((d^-d))/d=d^/d-1

.Thus, the MRF model will be built in order to minimize this value.

Continued: here

Source: https://habr.com/ru/post/95541/

All Articles