📜 ⬆️ ⬇️

Upgrade viola jones

In my previous topic, I tried to show how the Viola Jones method works, using what technologies and internal algorithms. In this post, in order not to interrupt the chain, there will also be a lot of theory, it will be shown at the expense of what can be improved and so beautiful method. If we also describe the software implementation here , then there will be a huge canvas that will be very inconvenient to read, and it will not look at all - it was decided to divide the amount of information into two separate posts. Below is a theory, few pictures, but many useful things.


Problems at the stages of recognition of emotions


The person does not even notice how he simply copes with the tasks of detecting faces and emotions with the help of his vision. When the eye looks at the surrounding faces of people, objects, nature, one does not subconsciously feel how much work the brain does to process the entire flow of visual information. It will not be difficult for a person to find a familiar person in a photograph, or to distinguish a snide grimassu from a smile.
A person tries to recreate and build a computer system for detecting faces and emotions - he partly succeeds in this, but every time he has to face big problems in recognition. Computers nowadays can easily store huge amounts of information, pictures, video and audio files. But to find computing systems with the same ease, for example, the right photo with a certain emotion of the right person from your own personal photo gallery is a difficult task. The solution of this problem is hampered by some factors :

The list can be continued for a long time. But attention is focused on the most important points, so it makes no sense to list all the interfering parameters.

Criteria for choosing a method in the task of recognizing emotions


Comparison of the quality of recognition of various methods is complicated by many reasons. One of them, and the most significant one, is that in most cases you can only rely on test data provided by the authors themselves, since conducting large-scale research on the implementation of most known methods and comparing them to each other on a single set of images is not possible:

The main points influencing the choice of the method of solving the problem and the conditions of the criteria revealed the following (in the development process, those that are italicized are most optimal):
')
• restrictions on individuals:
- a limited, predetermined set of people;
- restrictions on the possible race of people or on the characteristic differences on the face (for example, facial hair, glasses);
- no restrictions .

• face position on the image:
- frontal;
- tilt at a known angle;
- any slope .

• image type:
- color;
- black and white;
- any .

• face scale:
- fully in the "main frame";
- localized in a certain place with a certain size;
- it doesn't matter .

• resolution and image quality:
- bad / bad;
- bad / good;
- only good;
- any .

• noise, interference, moire:
- weak;
- strong;
- it doesn't matter .

• number of faces in the image:
- known;
- approximately known;
- unknown .

• lighting:
- known;
- approximately known;
- any .

• background:
- fixed;
- contrast monophonic;
- low contrast noise;
- unknown;
- no difference .

• importance of research:
- it is more important to identify all faces and features ;
- to minimize the errors of false detections by minimizing the detections on the image.

Suggested improvements


The Viola-Jones method sometimes produces startling errors, as some Habrawers mentioned. Therefore, the most logical solution in this situation was to modify the Viola-Jones algorithm and improve its recognition characteristics, if at all possible.
The method proposed below suggests a possible disposal of:

Taking into account the above, this method of work should reduce the time for determining emotions.
The scheme of the new method is presented in the figure below:


Additional Haar primitives


So, to improve the capabilities of the algorithm and the quality of finding inclined facial features, I propose such a variant of new types of features, as in the figure. Such signs were not chosen by chance, but in connection with the silhouettes of the desired features. Using these features, the program will search much faster for the nose, eyebrows and mouth, but it will spend a little more time on other facial features than in the standard algorithm, but additional features should be implemented: in general, recognition milliseconds are won and slightly, but accuracy increases.

Preliminary image processing


The described algorithms that the Viola-Jones method uses are quite widely used. The method itself can successfully interact with other algorithms and can be adapted to specific needs and requirements. Also, it works not only with static images, you can freely process data in real time. However, even he does not provide the perfect recognition of emotions. Let's try to raise the recognition percentage even higher with the help of Viola-Jones. For this purpose, in order to increase productivity, it was decided to perform image pre-processing .
To create a fast and reliable way to determine the likely areas of a person’s face in order to speed up processing at further stages of detection, an algorithm is proposed for determining and distinguishing boundary contours . The idea of this approach is that it is possible to emphasize the areas in which the person and his features are most likely to be found. Thus, not only is the acceleration of the algorithm achieved, but the probability of false detection of faces is also reduced. These steps will be illustrated in the original image:


Grayscale image

First you need to translate the image in grayscale . For this purpose it is convenient to present the image in the color model YUV . For this conversion is performed:
, where R, G and B are the intensities of the specified colors , if absolutely accurate, then these are matrices describing the components of the model (R, G, B). Y in the formula is the brightness component , and U and V are the color difference components , the so-called chroma signals (these three parameters are also represented by matrices describing the components of the model (Y, U, V)). The translation coefficients present in the formula are constant and determined by the characteristics of human perception. For a halftone image, only the value of the first component is important.


Spatial Filtering

The translated halftone image is smoothed and spatial differentiation is performed - the gradient of the intensity function at each point of the image is calculated. It is assumed that the regions correspond to real objects, or their parts, and the boundaries of regions correspond to the boundaries of real objects. Detection of brightness gaps using a sliding mask (sliding window) is used . In different textbooks and articles, it is variously called, usually referred to as a filter or a kernel, which is a square matrix corresponding to the mapped Y of the original image. The spatial filtering scheme is shown below:


Sobel operator

In the applied method, using special kernels, known as " Sobel operators ", operating in the image area of ​​3 * 3 size, weighting factor 2 is used for middle elements. The coefficients of the kernel are chosen so that when it is applied, the smoothing in one direction is simultaneously performed and the spatial derivative is calculated in the other. The masks used by the Sobel operator are shown below:

From them we get the components of the gradient G x and G y :
G x = (z 7 + 2z 8 + z 9 ) – (z 1 + 2z 2 + z 3 ) (1.2)

G y = (z 3 + 2z 6 + z 9 ) – (z 1 + 2z 4 + z 7 ) (1.3)

To calculate the magnitude of the gradient, these components must be used together:
(1.4)
or (1.5)
The result shows how "sharply" or "smoothly" the brightness of the image at each point changes, which means that the probability of finding a point on the face, as well as the orientation of the border. The result of the work of the Sobel operator at the point of the constant brightness region will be the zero vector, and at the point lying on the border of the regions of different brightness, the vector intersecting the boundary in the direction of increasing brightness. Also, the result of the work is the translation of the photo into boundary contours (this can be seen in the transformed picture):

Threshold classification

Further, the same halftone image is subjected to a threshold classification (thresholding) , or the choice of the threshold in brightness. The meaning of this threshold is to separate the desired light object (foreground) and dark background (background), where the object is a collection of those pixels whose brightness exceeds the threshold (I> T), and the background is a collection of other pixels whose brightness is lower threshold (I <T). An approximate algorithm for working with the global threshold in automatic mode looks like this:
  1. The initial estimate of the threshold T is selected — this may be the average image brightness level (half the sum of the minimum and maximum brightness, min + max / 2) or any other threshold criterion;
  2. Some segmentation of the image is performed using the threshold T - as a result, two groups of pixels are formed: G 1 consisting of pixels with brightness greater than T, and G 2 consisting of pixels with brightness less than or equal to T;
  3. Calculate the average values ​​of μ 1 and μ 2 the brightness of the pixels in the areas of G 1 and G 2 ;
  4. A new threshold value T = 0.5 * (μ 1 + μ 2 ) is calculated;
  5. The steps from the second to the fourth are repeated until the difference in the values ​​of the threshold T of the previous one and the T calculated in the last iteration is less than the pre-specified parameter ε.


Otsu binarization

The Sobel operator cannot do this, since it gives a lot of noise and interference, therefore the halftone image must be binarized . There are dozens of methods for selecting the threshold, but it is the method that was invented by the Japanese scientist Nobuyuki Otsu (Otsu's Method) in 1979 [1], because he is fast and efficient. The brightness values ​​of the image pixels can be viewed as random variables, and their histogram as an estimate of the density of the probability distribution. If the probability density distributions are known, then the optimal (in the sense of minimum error) threshold can be defined for segmentation of an image into two classes c 0 and c 1 (objects and background). A histogram is a set of bins (or columns), each of which characterizes the number of sampling elements hit into it. In the case of a halftone image, these are pixels of different brightness, which can take integer values ​​from 0 to 255. Thanks to the histogram, a person can see two clearly separated classes . The essence of the Otsu method is to set the threshold between classes in such a way that each of them is as “dense” as possible . If expressed in mathematical language, it comes down to minimizing the intraclass variance, which is defined as the weighted sum of the variances of the two classes (that is, the sum of deviations from the mathematical expectations of these classes):
σ w 2 = w 1 * σ 1 2 + w 2 * σ 2 2 , (1.6)
where σ w is the intraclass dispersion, σ 1 and σ 2 are the dispersions, and w 1 and w 2 are the probabilities of the first and second classes, respectively. In the Otsu algorithm, minimizing intraclass dispersion is equivalent to maximizing interclass dispersion, which is:
σ b 2 = w 1 * w 21 – μ 2 ) 2 , (1.7)
where σ b is the interclass dispersion, w 1 and w 2 are the probabilities of the first and second classes, respectively, and μ 1 and μ 2 are the arithmetic average values ​​for each of the classes. The cumulative variance is expressed by the formula
σ T 2 = σ w 2 + σ b 2 . (1.8)
The general scheme of the fast algorithm is as follows:
  1. Calculate the histogram (one pass through an array of pixels). Further only the histogram is necessary; passes through the entire image is no longer required.
  2. Starting from the threshold t = 1, we go through the entire histogram, recalculating the variance σ b (t) at each step. If at some of the steps the variance is greater than the maximum, then the variance is updated and T = t.
  3. The required threshold is T.

In a more accurate implementation, there are parameters that allow to speed up the algorithm, for example, the passage through the histogram is not made from 1 to 254, but from min to (max -1) brightness. The “classic” binarization is shown in the photo below:


The advantages of the Otsu method are:
Disadvantages of the method:

Canny Boundary Detector

So, based on the previous image processing steps, the edges of the image and the splitting by the threshold are obtained in accordance with the brightness values. Now, in order to fix the result, it is necessary to apply the Canny border detector to find and finally link the edges of the objects in the image to the contours. In his research [2], John Kanni achieved the construction of a filter that is optimal in terms of the selection, localization, and minimization of several responses of one edge.
With the help of the Kanni algorithm, the following problems are solved:
  1. Suppression of "false" highs. Only some of the highs are marked as boundaries;
  2. Subsequent double threshold filtering. Potential boundaries are determined by thresholds. Splitting into thin edges (edge ​​thinning);
  3. Suppression of all ambiguous edges that do not belong to any region boundaries. Linking edges to contours (edge ​​linking).

In the work of Kanni, such a concept is used as the suppression of “false” maxima (non-maximum suppression) , which means that the pixels of the borders are assigned such points where a local gradient maximum is reached in the direction of the gradient vector found . Using the Sobel operator, we obtain the angle of the direction of the boundary vector (follows from formula 1.4):
β(x,y) = arctg(G x /G y ), (1.9)
where β (x, y) is the angle between the direction of the ∇f vector at the point (x, y) and the y axis, and Gy and Gx are the components of the gradient. The direction of the vector must be a multiple of 45 °. The angle of the direction of the border vector is rounded to one of the four corners representing the vertical, horizontal and two diagonals (for example, 0, 45, 90 and 135 degrees). Then the conditions are checked to see if the magnitude of the gradient reaches a local maximum in the corresponding direction of the vector. The following is an example for a 3 * 3 mask :

After the suppression of local uncertainties, the edges become more accurate and thin. Thus, a binary image is obtained that contains boundaries (the so-called “thin edges” ). They can be seen on the converted image:

Border selection Kanni uses two filtering thresholds :

Combing image

After the Kanni detector, the image undergoes a morphological operation of dilatation - expansion, thickening of the found boundaries due to the fact that the structure-forming set runs over them.
And finally, the last step is subtracting the resulting binarized image from the original. Approximately this image should come out at the exit from the training of a cunning cat and a tear-stained, stunned child to recognize their emotions:


findings


This approach is universal, expandable and if a sufficient number of cascade classifiers have been trained, the method of finding works very quickly and almost without error.

Summing up:
The mechanism of operation of the Viola-Jones algorithm (Viola-Jones), described in detail, is available and described in this post, my suggested approaches to improving the efficiency of solving the problem of detecting human emotions. Naturally, this is not the only way to modify the method. You can try your own. With the help of such upgrades, a theoretical improvement of the method is possible.
Now the consistency of this algorithm and its applicability in practice should be shown. I will tell about its implementation in my next post and show it with the help of illustrations. Thanks for attention! I hope that you were interested!

Bibliography


1. Otsu, N., “A Threshold Selection Method from Gray-Level Histograms,” IEEE Transactions on Systems, Man, and Cybernetics, Vol. 9, No. 1.1979., Pp. 62-66
2. John Canny, “A Computational Approach to Edge Detection,” IEEE Transactions, Vol.8, No. 6, 1986., pp.679 -698
3. Algorithms for image contour selection

Source: https://habr.com/ru/post/133909/


All Articles