📜 ⬆️ ⬇️

Looking for familiar faces

Hello

In the article I want to acquaint the reader with the task of identification: walk from the basic definitions to the implementation of one of the recent articles in this area. The result should be an application that can search for the same people in the photos and, most importantly, an understanding of how it works.

The Bourne Identification (and not only the Bourne)



The task of identification is similar to the task of classification and historically arose from it, when instead of determining the class of an object, it became necessary to be able to determine whether an object has the required property or not. In the identification task, the training set is a set of objects M = \ {x_i \} _ {i = 1} ^ n each of which either has a property A : A (x_i) = 1, \;  i = \ overline {1, n} or not. Moreover, all objects belong to the same class and it is impossible to make a representative selection of objects that do not have the specified property. A .

For example, if you want to separate the faces of people from the faces of monkeys, then this is a classification task - there are two classes, for each object you can specify a class and make a representative sample of both classes. If it is required to determine to which person the face image belongs and these people are a finite fixed set, then this is also a classification task.

Now imagine that you are developing an application that should determine a person from a photo of his face, and many people remembered in the database are constantly changing and, naturally, during use, the application will see people who were not in the training set - a real task, which Today’s world is no surprise. However, it is no longer limited to the task of classification. How to solve it?

To be able to recognize a person, you must see him at least once. Or rather, to have at least one of his photos or to remember it in the form of some image. Then, when we are shown a new, previously unknown photograph, we will compare it with all the memorized images and will be able to give an answer: we have already seen this person and can identify him or this person has never met us before and we can only remember him. Thus the task stated above is reduced to the following: having two photos (p_1, \; p_2) determine whether they belong to the same person or not. In other words, do they have the property of belonging to one person:
')
A (p_1, \; p_2) = \ begin {cases} 1 & amp;  \ text {,} p_1 \ text {and} p_2 \ text {belong to one person} \\ 0 & amp;  \ text {otherwise} \ end {cases}



Here is the definition of the task of identification: according to the training sample (in the example: many pairs of faces M = \ {(p_i, p_j) \} _ {i, j = 1} ^ n a) build id F: R ^ n \ to \ {0, 1 \} that can determine whether the object possesses the required attribute or not. But discreteness is boring, it is much more interesting to know the degree of manifestation of the trait. A the object is equal to probability p \ big (A (x) = 1 \ vert x \ big) \ in R .

We will set the last problem from the example in a slightly more formal and heroic way of solving it, calling arxiv.org , Python, and Keras for help. Face photos - matrix of R ^ {m \ times n} . For two people, we want to know the probability of their belonging to one person. Probability - a real number from 0 to 1. So, we are looking for a function F: R ^ {m \ times n} \ times R ^ {m \ times n} \ to [0;  one] . Since it is 2017 already in the yard, we will look for it using machine learning methods. Teaching set, however, we will not have a lot of pairs from the definition, but a lot of photos of the faces of different people with labels, to whom it belongs - just like for the task of classification. These sets are equivalent, but it is easier to work with the second one.

Superiority bourne



What is most important in solving the problems of machine learning? Do you think the ability to search for an answer? No, the main thing is the ability to verify this answer. Function F (x, y) = \ xi \ sim U [0;  one] returning random numbers from the interval [0;  one] is quite a correct solution to the problem, but in practice it is useless. How to find out without resorting to the "by eye" method? In the classification problem, we can see accuracy on the validation set - the proportion of correctly classified examples. For the identification task, this metric is not applicable. So how to objectively assess the suitability of the solution to our problem?

Let's introduce the concept of target attempts and imposter tests. First we name the object x \ in X for which it is known that p \ big (A (x) = 1 \ vert x \ big) = 1 i.e. the object has the required property with probability 1 (in our problem - a pair of individuals (p_1, p_2) belonging to one person), and the second, respectively, the object x \ in X such that p \ big (A (x) = 1 \ vert x \ big) = 0 . Accordingly, consider the sets T = \ {x \ in X \ big \ vert p \ big (A (x) = 1 \ vert x \ big) = 1 \} and I = \ {x \ in X \ big \ vert p \ big (A (x) = 1 \ vert x \ big) = 0 \} which together will be our validation set: T \ cup I = \ overline {M} . The considerations of his choice are absolutely standard, as for any machine learning task - it must be representative, i.e. reflect all the required variability and be of sufficient size. For example, if your face recognition system involves working in different lighting conditions, these conditions should be represented in the validation set (as in the training set, of course).

Now let's take our constructed function. F and build F (T) = \ {F (t) \ vert t \ in T \} - result of application F to all target attempts. Get the set of real numbers from the interval [0;  one] or scores . These values ​​are a measure of how well our solution works when it is not tried to be deceived and they expect a positive response from it. And F (t) - some sample that we consider to be representative. We construct its empirical density - a histogram.

So she looks for F (x, y) = \ xi \ sim U [0;  one] :

Distribution of target attempts for uniform f

Agree, the sense of such an identifier is a little - in half of the cases it will guess the answer, and in half it will be wrong.

This distribution suits us better:

Distribution of target-attempts for good f

For target attempts, such a function will be more confident in their correctness than not.

But consideration of the distribution of the target without the distribution of imposters is meaningless. Let's do the same operation for them: build the distribution density F (I) and display on the same graph. We get a similar picture:

target distribution and imposter attempts

It becomes clear that for imposter attempts, our function will in most cases be inclined to the correct answer. But these are still only visual observations; they do not give us any objective assessment.

Suppose our system is fed a couple of images at the entrance. She can calculate for them the probability that this is a target . But it requires an unequivocal answer from her: whether it is the same person or not, whether to let him on a secret object or not. Let's set some threshold d \ in [0; one] and if F (t) & lt; d , we will respond negatively, otherwise - positively. If a d = 0 , the system will never know anyone, and if d = 1 , then any two people will be considered the same. The graph shows that our distributions are not separable and cannot be matched. d so as to achieve perfect work in both cases. What will happen, for example, if in the example above install d = 0.5 ?

distribution of target and imposter attempts with a middle splitter

In how many cases will our system be wrong with target attempts? Easy to count: \ big \ vert \ {x \ in F (T) \ vert x & lt; d \} \ big \ vert - the number of target attempts, which will be incorrectly classified as imposter attempts. Similarly for the latter. And now we give them the names and make not absolute, but relative:

FRR = \ frac {\ big \ vert \ {x \ in F (T) \ vert x & lt; d \} \ big \ vert} {\ big \ vert F (T) \ big \ vert}

FAR = \ frac {\ big \ vert \ {x \ in F (I) \ vert x & gt; d \} \ big \ vert} {\ big \ vert F (I) \ big \ vert}


Let's now take some step. \ Delta d = \ frac {1} {N} and calculate with it the values ​​of FRR and FAR for N points from the interval [0; one] and display them on the same graph:

far and frr

Now, for any chosen distance, we can say what proportion of target attempts will deviate and what proportion of imposter attempts will be taken. Conversely, you can choose d based on the task. For example, for verification at a guarded object, where it is important not to miss an outsider, for obvious reasons you need d providing the lowest possible FAR . If you want the computer to recognize you and wish you good morning, and usually only you walk through the apartment, you can stop at a low FRR and a high enough FAR - nothing bad happens if the computer says hello to someone, calling him your name.

Pay attention to the point of intersection of the graphs. The value in it is called EER (Equal Error Rate).

EER = \ arg \ min_ {FAR} \ big \ vert FAR - FRR \ big \ vert

If we choose d = d_ {eer} , FAR and FRR values ​​will be equal. EER is the very objective criterion to which we went. It allows us to assess the quality of identification as a whole: this is the average error on our validation set. You can also look at FAR with a fixed FRR and vice versa, but most often EER is used as a metric.

In the example above, EER = 0.067. This means that on average 6.7% of all target attempts will deviate and 6.7% of all imposter attempts will be accepted.

Another important concept is the DET curve - the dependence of FAR on FRR on a logarithmic scale. By its form it is easy to judge the quality of the system as a whole, to evaluate what value of one criterion can be obtained with a fixed second, and, most importantly, compare systems.

det curve

ERR here is the intersection of a DET curve with a straight line. y = x .

A naive implementation in Python (you can be more optimal if you consider that FAR and FRR change only at points from F (T) \ cup F (I) ):

import numpy as np def calc_metrics(targets_scores, imposter_scores): min_score = np.minimum(np.min(targets_scores), np.min(imposter_scores)) max_score = np.maximum(np.max(targets_scores), np.max(imposter_scores)) n_tars = len(targets_scores) n_imps = len(imposter_scores) N = 100 fars = np.zeros((N,)) frrs = np.zeros((N,)) dists = np.zeros((N,)) mink = float('inf') eer = 0 for i, dist in enumerate(np.linspace(min_score, max_score, N)): far = len(np.where(imposter_scores > dist)[0]) / n_imps frr = len(np.where(targets_scores < dist)[0]) / n_tars fars[i] = far frrs[i] = frr dists[i] = dist k = np.abs(far - frr) if k < mink: mink = k eer = (far + frr) / 2 return eer, fars, frrs, dists 

With the control figured out: now, whatever function F we have not chosen, we can calculate for it FAR , FRR and ERR on the validation set and construct several visual graphs.

Important: in the identification tasks, what we called the validation set above is usually called the development set (development set, devset). We will adhere to this notation in the future.

Important: since any interval of the real axis R can be displayed in the segment [0; one] , it is not at all necessary that the value of our function be in the range from zero to one. Even without displaying into a single segment, any range of values ​​can be considered without affecting the results.

Base preparation


There are many face recognition datasets. Some are paid, some are available on request. Some contain greater variability in lighting, others in facial position. Some were taken under laboratory conditions, others are collected from photographs taken in their natural habitat. Having clearly defined the data requirements, you can easily select a suitable dataset or assemble it from several. For me, within the framework of this educational task, the requirements were as follows: datasets should be easily available for download, contain not very much data and contain variability in the position of the face. The requirements were satisfied with three data sets, which I combined into one:


All of them are outdated for a long time and do not allow to build a high-quality modern facial recognition system, but they are ideal for learning.

The base thus obtained turned out to be 277 subjects and ~ 4000 images, on average, 14 images per person. Take 5-10% of subjects for development- sets, the rest use for training. When training, the system should see only examples from the second set, and we will check it (consider EER ) at the first.

The code for data separation is available here . You only need to specify the path to the unpacked datasets listed above.

Now the data need to be processed. For a start, highlight faces. We will not do this ourselves, but use the dlib library.


 import dlib import numpy as np from skimage import io image = io.imread(image_path) detector = dlib.get_frontal_face_detector() face_rects = list(detector(image, 1)) face_rect = face_rects[0] 


As you can see, using this library, you can get the bounding rectangle for a couple of lines of code. And the dlib detector, in contrast to OpenCV , works very well: from the entire base only a dozen faces it could not detect and did not create a single false response.

The formal formulation of our task implies that all individuals must be the same size. We will satisfy this requirement, at the same time aligning all faces so that the key points (eyes, nose, lips) are always in the same place of the image. It is clear that such a measure can help us regardless of the chosen mode of study and certainly does not harm. The algorithm is simple:

  1. There are a priori positions of key points in the unit square.
  2. Knowing the selected image size, we calculate the coordinates of these points on our image by simple scaling.
  3. Select the key points of the next person.
  4. We construct an affine transformation that takes the second set of points to the first.
  5. Apply an affine transformation to the image and trim it.

The reference position of the key points will be found in the examples to dlib (face_template.npy, download here ).

 face_template = np.load(face_template_path) 

To find the key points on the image of the face, again, let's use dlib, using the already trained model, which can be found there, in the examples (shape_predictor_68_face_landmarks.dat, download here ).

 predictor = dlib.shape_predictor(dlib_predictor_path) points = predictor(image, face_rect) landmarks = np.array(list(map(lambda p: [px, py], points.parts()))) 

Affine transformation is uniquely defined by three points:

 INNER_EYES_AND_BOTTOM_LIP = [39, 42, 57] 


Let be (x_1 ^ 0, y_1 ^ 0), (x_2 ^ 0, y_2 ^ 0), (x_3 ^ 0, y_3 ^ 0) - the starting points that we want to translate into (x_1 ^ 1, y_1 ^ 1), (x_2 ^ 1, y_2 ^ 1), (x_3 ^ 1, y_3 ^ 1) . Then the affine transform expressed by the matrix T can be found from the relation

\ begin {bmatrix} x_1 ^ 1 & amp; x_2 ^ 1 & amp; x_3 ^ 1 \\ y_1 ^ 1 & amp; y_2 ^ 1 & amp; y_3 ^ 1 \\ 1 & amp; 1 & amp; 1 \ end {bmatrix} = T \ begin {bmatrix} x_1 ^ 0 & amp; x_2 ^ 0 & amp; x_3 ^ 0 \\ y_1 ^ 0 & amp; y_2 ^ 0 & amp; y_3 ^ 0 \\ 1 & amp; 1 & amp; 1 \ end {bmatrix}.

Find it:

 proper_landmarks = 227 * face_template[INNER_EYES_AND_BOTTOM_LIP] current_landmarks = landmarks[INNER_EYES_AND_BOTTOM_LIP] A = np.hstack([current_landmarks, np.ones((3, 1))]).astype(np.float64) B = np.hstack([proper_landmarks, np.ones((3, 1))]).astype(np.float64) T = np.linalg.solve(A, B).T 

And apply to the image using the scipy-image library:

 import skimage.transform as tr wrapped = tr.warp( image, tr.AffineTransform(T).inverse, output_shape=(227, 227), order=3, mode='constant', cval=0, clip=True, preserve_range=True ) wrapped /= 255.0 


The full preprocessing code, wrapped in a convenient api, can be found in the preprocessing.py file.

The final chord of data preparation will be normalization: we calculate the average and standard deviation on the basis of the training and normalize each image on them. Do not forget about the development-set. See the code here .



Collected, divided, aligned and normalized data can be downloaded here .

Bourne ultimatum


Data found and prepared, with a technique of testing sorted. Half the battle is done, the easiest is to find F solving a problem with a good EER . Let us immediately determine that EER = 10% will fully suit us for such an educational task. In fact, such a system can even be used in simple applications like searching for identical faces in two photographs.

Coin


Let's start the search from the very example of a bad function. F (x, y) = \ xi \ sim U [0; 1] . For each pair of photos from the development set, we will get a random value, build a DET curve using them and find the EER .

bad random function det-curve

fu

C EER = 49.5%, such an identifier is no better than a coin that we would toss when making each decision. Of course, this is understandable without graphs, but our goal is to learn how to solve identification problems and to be able to objectively evaluate any decisions, even obviously bad ones. In addition, it will from what to push off.

Distance


What is the function of two vectors from R ^ {m \ times n} returning a real number comes to mind first? I am sure most of you will answer this question - distance. Really, R ^ {m \ times n} - metric space, which means for any two elements x and y distance is determined from it d (x, y) \ in R which can be entered in different ways. That's just need to consider it with a minus, since the distance varies from zero to plus infinity, and in the formalization adopted by us should be the opposite.

Take, for example, the cosine distance:

-d (x, y) = \ cos (x, y) = \ frac {x \ cdot y} {\ Vert x \ Vert \ cdot \ Vert y \ Vert}


And we will do all the same operations on the development-set:

 dev_x = np.load('data/dev_x.npy') protocol = np.load('data/dev_protocol.npy') dev_x = dev_x.mean(axis=3).reshape(dev_x.shape[0], -1) dev_x /= np.linalg.norm(dev_x, axis=1)[:, np.newaxis] scores = dev_x @ dev_x.T tsc, isc = scores[protocol], scores[np.logical_not(protocol)] eer, fars, frrs, dists = calc_metrics(tsc, isc) 

We get this DET curve:

cosine det-curve

EER decreased by 16% and became equal to 34.18%. Better, but still not applicable. Of course, since we have so far only selected the function, without using the training set and machine learning methods. However, the idea is sensible with distance: let's leave it and present our function F as

F = d (f (x), f (y))

Where d - cosine distance, and f: R ^ {m \ times n} \ to R ^ k - some function that we call the embedder , and the results of its work from the space R ^ k - embeddings . It “embeds” our images into some space of another (not necessarily smaller) dimension, considering the a posteriori experience obtained from the training set.

CNN


Great, you and I just made the task even easier. It remains only to find a good function. f All other parts of the system are already in place. We will not beat around the bush - we all know that currently there is no model better coping with images than CNN (Convolutional Neural Networks). Let's build a convoluted network of some not very complex architecture, for example:

cnn architecture

Model in keras
 from keras.layers import Flatten, Dense, Dropout from keras.layers.convolutional import Convolution2D, MaxPooling2D from keras.layers.advanced_activations import PReLU from keras.models import Sequential model = Sequential() model.add(Convolution2D(96, 11, 11, subsample=(4, 4), input_shape=(dim, dim, 3), init='glorot_uniform', border_mode='same')) model.add(PReLU()) model.add(MaxPooling2D((3, 3), strides=(2, 2))) model.add(Convolution2D(256, 5, 5, subsample=(1, 1), init='glorot_uniform', border_mode='same')) model.add(PReLU()) model.add(MaxPooling2D((3, 3), strides=(2, 2))) model.add(Convolution2D(384, 3, 3, subsample=(1, 1), init='glorot_uniform', border_mode='same')) model.add(PReLU()) model.add(Convolution2D(384, 3, 3, subsample=(1, 1), init='glorot_uniform', border_mode='same')) model.add(PReLU()) model.add(Convolution2D(256, 3, 3, subsample=(1, 1), init='glorot_uniform', border_mode='same')) model.add(PReLU()) model.add(MaxPooling2D((3, 3), strides=(2, 2))) model.add(Flatten()) model.add(Dropout(0.5)) model.add(Dense(2048, init='glorot_uniform')) model.add(PReLU()) model.add(Dropout(0.5)) model.add(Dense(256, init='glorot_uniform')) model.add(PReLU()) model.add(Dense(n_classes, init='glorot_uniform', activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 



And we will teach her to solve the classical classification problem on our training set: to determine which of the 250 subjects owns a face photo. Everyone knows how to solve such a simple task, in keras for this, besides the code above, you will need more lines 5-6. Let me just say that for the training base described in this article, it is vital to apply augmentation , otherwise there is not enough data to achieve good results.

You ask, what does the classification task have to do with it and how will its solution help us? Do it right! In order for the actions described below to make sense, it is necessary to make a very important assumption : if the network learns well to solve the classification problem in a closed set, then the penultimate layer of dimension 256 will concentrate all the important information about the face image, even if the subject was not in the training set .

This technique of extracting low-dimensional characteristics from the last layers of a trained network is widespread and is called bottleneck . By the way, the code for working with bottleneck in keras is located by reference .

The network was trained, 256-dimensional signs from the development-set were learned. We look at the DET- curve:

det curve cnn + cos

The assumption turned out to be true, reducing the EER by another 13%, reaching a result of 21.6%. Twice better than tossing a coin. Is it even better? Of course, you can build a bigger and more varied base, build a deeper CNN, apply various regularization methods ... But we are considering conceptual approaches, qualitative ones. And the amount can always be increased. I still have another trump card up my sleeve, but before putting it on the table, I’ll have to digress a little.

The Bourne Evolution



The key to improving our results lies in the realization that optimizing f information can be used not only from the training set, but also information about the function d . Indeed, let's fix d and we will train f based on a priori knowledge of how embedding will then be used to produce a score . For the first time, this approach was suggested by the guys from Google on FaceNet: A Unified Embedding for Face Recognition and Clustering .

The approach proposed by them was called TDE (Triplet Distance Embedding) and consisted of the following: let's build f as a network from the source space R ^ {m \ times n} into the embedding space R ^ k without the need to solve an intermediate classification problem, we fix d as Euclidean distance and take it into account in loss -functions. How? It is quite intuitive: we want the vectors of one subject to be in the target space as close as possible to each other and, furthermore, away from the vectors of other subjects.



Teaching such a network was suggested using triples. (x_a, x_p, x_n) where x_a (anchor) and x_p (positive) belong to the same subject, and x_n (negative) - to another. For all three vectors, build embeddings and f (x_a) , f (x_p) and f (x_n) . Let us set some parameter in advance. \ alpha . We assume that the troika is good if the ratio is

\ Vert f (x_a) - f (x_n) \ Vert_2 ^ 2 - \ Vert f (x_a) - f (x_p) \ Vert_2 ^ 2 & gt; \ alpha,


which means that for a given anchor there is a gap between the spheres on which the positive and negative lies \ alpha . If this relationship holds for all possible triples from the training set, we have ideally separated the data. And it makes sense to train the network only on those triples for which this inequality is violated. Based on the inequality, you can build a loss function for the network f :

L (x_a, x_p, x_n, f) = \ frac {1} {N} \ sum_ {i = 1} ^ N \ Big [\ Vert f (x_a ^ i) - f (x_p ^ i) \ Vert_2 ^ 2 - \ Vert f (x_a ^ i) - f (x_n ^ i) \ Vert_2 ^ 2 + \ alpha \ Big].

Using this approach, the authors reduced the error by 30% on datasets Labeled Faces in the Wild and YouTube Faces DB , which is undoubtedly very cool. However, there is an approach and problems:


This is where TPE (Triplet Probabilistic Embedding) comes on the scene - the approach described in Triplet Probabilistic Embedding for Face Verification and Clustering .

Why enter an extra parameter \ alpha When can we demand respect for much simpler inequality? Here it is:

d (f (x_a), f (x_n)) & gt; d (f (x_a), f (x_p)).


It is simpler than the original and easily interpreted: we want the closest negative example to us to go further than the most positive positive example from us, but there should not be any gap between them. Due to the fact that we do not stop updating the network when the distance \ alpha reached, embedding groups can be spaced apart R ^ k even better.

We can calculate the probability that a triplet satisfies the given inequality:

p = \ frac {e ^ {d (f (x_a), f (x_p))}} {e ^ {d (f (x_a), f (x_p))} + e ^ {d (f (x_a), f (x_n))}}.

Divide by e ^ {d (f (x_a), f (x_p))} :

p = \ frac {1} {1 + e ^ {d (f (x_a), f (x_n)) - d (f (x_a), f (x_p))}} = \ sigma \ big (d (f ( x_a), f (x_p)) - d (f (x_a), f (x_n)) \ big)

We will maximize the logarithm of probability, so the loss function will look like this:

L (x_a, x_p, x_n, f) = - \ frac {1} {N} \ sum_ {i = 1} ^ N \ log \ sigma \ big (d (f (x_a ^ i), f (x_p ^ i )) - d (f (x_a ^ i), f (x_n ^ i)) \ big).

And as a function f The authors propose to use not a giant CNN , but a simple matrix: f (x) = Wx and teach it on the bottleneck signs we've already received. Here are the results achieved by the authors of the work:

original cnn-tpe results

As you can see, this approach works better than the original one and has many advantages:


We use this approach with you. All you need is 20 lines of code:

 def triplet_loss(y_true, y_pred): return -K.mean(K.log(K.sigmoid(y_pred))) def triplet_merge(inputs): a, p, n = inputs return K.sum(a * (p - n), axis=1) def triplet_merge_shape(input_shapes): return (input_shapes[0][0], 1) a = Input(shape=(n_in,)) p = Input(shape=(n_in,)) n = Input(shape=(n_in,)) base_model = Sequential() base_model.add(Dense(n_out, input_dim=n_in, bias=False, weights=[W_pca], activation='linear')) base_model.add(Lambda(lambda x: K.l2_normalize(x, axis=1))) a_emb = base_model(a) p_emb = base_model(p) n_emb = base_model(n) e = merge([a_emb, p_emb, n_emb], mode=triplet_merge, output_shape=triplet_merge_shape) model = Model(input=[a, p, n], output=e) predict = Model(input=a, output=a_emb) model.compile(loss=triplet_loss, optimizer='rmsprop') 

If you want to use TPE in your projects, do not be lazy to read the original work, since I did not cover the main issue of training with triplets - the question of their selection. There is enough random choice for our small task, but this is the exception rather than the rule.

Let's train TPE on our bottleneck and take a look at the DET curve for the last time:

det curve after tpe

An EER of 12% is very close to what we wanted. This is twice as good as using CNN and 5 times better than random. The result, of course, can be improved by using a deeper architecture and a large base, but for the understanding of the principle and such a result is satisfactory.

Comparison of DET curves for all considered methods:

det-curves of all considered methods

It remains to tie all kinds of machines and interfaces to our system, be it a web interface or an application on Qt, and the program for searching for identical faces in the photos is ready.

sample application

The application can be found on GitHub .

Thanks for reading! Put likes, subscribe to a profile, leave comments, learn machines good. Additions are welcome.

Literature


Source: https://habr.com/ru/post/317798/


All Articles