Have you noticed that Facebook has gained an uncanny ability to recognize your friends in your photos? In the old days, Facebook marked your friends in photos only after you clicked on the corresponding image and entered your friend's name through the keyboard. Now after you upload your photo, Facebook notes any notes to you that
looks like magic :

Facebook automatically marks people in your photos that you noted once before. I can not decide for myself, it is useful or scary!
This technology is called face recognition. Facebook's algorithms can recognize the faces of your friends after you have marked them only a couple of times. This is an amazing technology: Facebook is able to recognize faces
with an accuracy of 98% - almost the same as a person!
')
Consider how modern facial recognition works! However, simply recognizing your friends would be too easy. We can go to the frontier of this technology in order to solve a more complicated task - let's try to distinguish
Will Ferrell (a famous actor) from
Chad Smith (a famous rock musician)!

One of these people is Will Ferrell. The other is Chad Smith. I swear - these are different people!
How to use machine learning for a very difficult problem
Until now, in
parts 1 ,
2 and
3, we used machine learning to solve isolated problems with only one step -
estimating the cost of a house ,
creating new data based on existing data and
determining whether an image contains an object . All these problems can be solved by choosing one machine learning algorithm, entering data and obtaining a result.
But face recognition is actually a sequence of several related problems:
1. First, you need to look at the image and find all the faces on it.
2. Secondly, it is necessary to focus on each face and determine that, despite the unnatural turn of the face or poor lighting, it is the same person.
3. Thirdly, it is necessary to highlight the unique characteristics of the face that can be used to distinguish it from other people - for example, the size of the eyes, the elongation of the face, etc.
4. In conclusion, it is necessary to compare these unique characteristics of the face with the characteristics of other people known to you in order to determine the name of the person.
The human brain does all this automatically and instantly. In fact, people recognize faces
extremely well and, ultimately, see faces in everyday objects:

Computers are incapable of such a high level of generalization (
at least for the time being ... ), so you have to teach them every step in the process separately.
It is necessary to build a
pipeline on which we will find a solution at each step of the face recognition process separately and transfer the result of the current step to the next. In other words, we will combine several machine learning algorithms into one chain:

How basic facial recognition can work
Face detection - step by step
Let's solve this problem consistently. At each step, we will learn about the new machine learning algorithm. I'm not going to explain each individual algorithm completely, so as not to turn this article into a book, but you will learn the basic ideas of each of the algorithms and learn how you can create your own face recognition system in Python using
OpenFace and
dlib .
Step 1. Finding all the faces.
The first step on our assembly line is
face detection . It is clear that you need to select all the faces in the photo before trying to recognize them!
If you have used a photograph in the past 10 years, then you probably have seen how the detection of faces works:

Face detection is a great thing for cameras. If the camera can automatically detect faces, then you can be sure that all faces will be in focus before the picture is taken. But we will use it for another purpose - to find the image areas that need to be transferred to the next stage of our pipeline.
Face detection became the mainstream in the early 2000s, when Paul Viola and Michael Jones invented a
way to detect faces that were fast enough to work on cheap cameras. However, now there are much more reliable solutions. We are going to use the
method opened in 2005 - a histogram of directional gradients (
HOG for short).
To detect faces in the image, we will make our image black and white, because Color data is not needed for face detection:

Then we look at each pixel in our image sequentially. For each individual pixel, its immediate surroundings should be considered:

Our goal is to highlight how dark the current pixel is compared to the pixels directly adjacent to it. Then we draw an arrow indicating the direction in which the image becomes darker:

When considering this single pixel and its closest neighbors, it can be seen that the image darkens upwards to the right.
If you repeat this process for
each individual pixel in the image, then, ultimately, each pixel will be replaced by an arrow. These arrows are called the
gradient , and they show the flow from light to dark throughout the image:

It may seem that the result is something random, but there is a very good reason for replacing pixels with gradients. When we analyze the pixels directly, then the dark and bright images of the same person will have very different pixel intensity values. But if we consider only the
direction of change in brightness, then both the dark and the bright images will have exactly the same idea. This greatly facilitates the solution of the problem!
But maintaining the gradient for each individual pixel gives us a way of carrying too much detail. We, ultimately,
do not see the forest because of the trees . It would be better if we could just see the main stream of light / dark at a higher level, thus considering the basic structure of the image.
To do this, divide the image into small squares of 16x16 pixels in each. In each square, it is necessary to calculate how many gradient arrows show in each main direction (i.e., how many arrows are directed up, up-right, right, etc.). Then the considered square in the image is replaced by an arrow with the direction prevailing in this square.
In the end result, we turn the original image into a very simple representation that shows the basic structure of the face in a simple form:

The original image is converted to a HOG representation, demonstrating the main characteristics of the image, regardless of its brightness.
To detect faces on this HOG image, all that is required of us is to find such a part of the image that most closely resembles the well-known HOG structure obtained from the group of people used for training:

Using this method, you can easily find faces on any image:

If you want to perform this step yourself using Python and dlib, then
there is a program that shows how to create and view HOG images.
Step 2. Location and mapping of faces
So, we have highlighted the faces in our image. But now the problem appears: the same person, viewed from different directions, looks completely different for the computer:

People can easily see that both images belong to actor Will Ferrell, but computers will treat them as the faces of two different people.
To take this into account, we try to transform each image so that the eyes and lips are always in the same place of the image. Comparison of individuals in the next steps will be greatly simplified.
For this, we use an algorithm called
“estimation of anthropometric points” . There are many ways to do this, but we are going to use the
approach proposed in 2014 by Wahid Kacemi and Josephine Sullivan .
The basic idea is that 68 specific points (
marks ) are present on each face — the protruding part of the chin, the outer edge of each eye, the inner edge of each eyebrow, etc. Then the machine learning algorithm is tuned to search for these 68 specific points on each face:

68 anthropometric points we have on each face
Below is the result of the location of 68 anthropometric points on our test image:

PROFESSIONAL BOARD FOR BEGINNERS: the same method can be used to put your own version of a 3D face filter into a real-time person in Snapchat!
Now that we know where the eyes and the mouth are, we will simply rotate, scale and
shift the image so that the eyes and mouth are centered as best as possible. We will not introduce any unusual 3D deformations, as they may distort the image. We will only do basic image transformations, such as rotation and scaling, which preserve parallel lines (so-called
affine transformations ):

Now, regardless of how the face is turned, we can center the eyes and mouth so that they are approximately in the same position in the image. This will make the accuracy of our next step much higher.
If you have a desire to try this step yourself using Python and dlib, then there is a
program for finding anthropometric points and a
program for converting an image based on these points .
Step 3. Encoding faces
Now we come to the essence of the problem - the very distinction of persons. This is where the fun begins!
The simplest approach to recognizing faces is a direct comparison of the unknown person found in step 2 with all those already marked. If we find an already-marked person, very similar to our unknown, it will mean that we are dealing with the same person. Sounds like a very good idea, isn't it?
In fact, with this approach, there is a huge problem. A site like Facebook with billions of users and trillions of photos cannot cycle through each previously marked person in a rather cyclical way, comparing it with each newly uploaded picture. It would take too much time. It is necessary to recognize faces in milliseconds, not hours.
We need to learn how to extract some basic characteristics from each person. Then we could get these characteristics from an unknown person and compare them with the characteristics of famous people. For example, you can measure each ear, determine the distance between the eyes, the length of the nose, etc. If you have ever watched a television series about the work of the Las Vegas forensic laboratory staff (
“CSI: a crime scene” ), then you know what it is about:

Like in the movies! So it seems to be true!
The most reliable method to measure face
Well, but what characteristics should be obtained from each person to build a database of famous people? Ear sizes? Nose length? Eye color? Anything else?
It turns out that the characteristics that seem obvious to us humans (for example, eye color) do not make sense for a computer analyzing individual pixels in an image. The researchers found that the most appropriate approach is to enable the computer to determine the characteristics that need to be collected. In-depth training allows better than people can do to identify parts of the face that are important for its recognition.
The solution is to train the deep convolutional neural network (
this is exactly what we did in release 3 ). But instead of learning the network to recognize graphic objects, as we did last time, we are now going to teach it to create 128 characteristics for each face.
The learning process is valid when considering 3 face images at the same time:
1. Download the training face image of a famous person.
2. Download another face image of the same person.
3. Upload another person's face image.
Then the algorithm considers the characteristics that it currently creates for each of the three images indicated. It slightly adjusts the neural network so that the characteristics created by it for images 1 and 2 are a little closer to each other, and for images 2 and 3 - a little further.
A single “structured” learning step:
After repeating this step millions of times for millions of images of thousands of different people, the neural network is able to reliably create 128 characteristics for each person. Any ten different images of the same person will give roughly the same characteristics.
Specialists in training machines call these 128 characteristics of each person a
set of characteristics (features) . The idea of ​​converting complex source data, such as, for example, an image, to a list of computer-generated numbers, turned out to be extremely promising in teaching machines (in particular, for translations). This approach for individuals that we use
was proposed in 2015 by researchers from Google , but there are many similar approaches.
The encoding of our face image
The training process of the convolutional neural network in order to output sets of facial characteristics requires a large amount of data and high computer performance. Even on an expensive
NVidia Telsa video card, it takes
about 24 hours of continuous training to get good accuracy.
But if the network is trained, then it is possible to create characteristics for any person, even for the one that has never been seen before! Therefore, this step is required only once. Fortunately for us, the good people at
OpenFace have already done this and have
provided access to several trained networks that we can immediately use. Thanks to
Brandon Amos and the team!
As a result, all that is required of us is to take our face images through their pre-trained network and get 128 characteristics for each person. Below are the specifications for our test image:

But what specific parts of the face do these 128 numbers describe? It turns out that we have no idea about this. But in fact it does not matter to us. We should be concerned only with the fact that the network gives out approximately the same numbers, analyzing two different images of the same person.
If there is a desire to try to perform this step independently, then OpenFace
provides a Lua script that creates the characteristics of all the images in the folder and writes them to a csv file. You can run it
as shown .
Step 4. Finding the name of the person after encoding the face
The last step is actually the easiest in the whole process. All that is required of us is to find a person in our database of famous persons who has the characteristics closest to those of our test image.
This can be done using any basic machine learning classification algorithm. No special deep learning techniques are required. We will use a simple linear
SVM classifier , but many other classification algorithms can be applied.
We are only required to train a classifier who can take the characteristics of a new test image and tell which famous person has the best fit. The work of such a classifier takes milliseconds. The result of the classifier is the name of the person!
Let's test our system. First of all, I trained the classifier using feature sets from approximately 20 images of Will Ferrell, Chad Smith and Jimmy Fallon:

Oh, these amazing pictures for learning!
Then I drove the classifier on each frame of the famous Youtube video, where on the Jimmy Fallon show
Will Ferrell and Chad Smith pretend to be each other :

It worked! And look how great it worked for people from many different directions - even in profile!
Independent execution of the whole process
Consider the required steps:
1. Process the image using the HOG algorithm to create a simplified version of the image. In this simplified image, find the area that is most similar to the created HOG-representation of the face.
2. Determine the position of the face by setting the main anthropometric points on it. After positioning these anthropometric points, use them to transform the image in order to center the eyes and mouth.
3. Pass the centered face image through a neural network trained in characterization of the face. Save the resulting 128 characteristics.
4. After reviewing all persons whose characteristics were removed earlier, identify the person whose facial characteristics are closest to those obtained. It is done!
Now that you know how it all works, review the instructions from the very beginning to the end of how to conduct the entire face recognition process on your own computer using
OpenFace :
Before you start
Make sure you have Python, OpenFace and dlib installed. You can
install them manually or use a pre-configured container image in which everything is already installed:
docker pull bamos/openface docker run -p 9000:9000 -p 8000:8000 -t -i bamos/openface /bin/bash cd /root/openface
Newbie professional advice: if you use Docker on OSX, you can make the OSX / Users / folder visible inside the container image, as shown below:
docker run -v /Users:/host/Users -p 9000:9000 -p 8000:8000 -t -i bamos/openface /bin/bash cd /root/openface
Then you can exit to all your OSX files inside the container image on / host / Users / ...
ls /host/Users/
Step 1
Create a folder called
./training-images/
in the openface folder.
mkdir training-images
Step 2
Create a subfolder for each person you want to recognize. For example:
mkdir ./training-images/will-ferrell/ mkdir ./training-images/chad-smith/ mkdir ./training-images/jimmy-fallon/
Step 3
Copy all the images of each person in the appropriate subfolders. Make sure there is only one face on each image. You do not need to crop the image around the face. OpenFace will do this automatically.
Step 4
Run openface scripts from the openface root directory:
Position detection and alignment must first be performed:
./util/align-dlib.py ./training-images/ align outerEyesAndNose ./aligned-images/ --size 96
As a result, a new subfolder
./aligned-images/
will be created with a cropped and aligned version of each of your test images.
Then create views from aligned images:
./batch-represent/main.lua -outDir ./generated-embeddings/ -data ./aligned-images/
The subfolder
./generated-embeddings/
will contain a csv-file with sets of characteristics for each image.
Train your face detection model:
./demos/classifier.py train ./generated-embeddings/
A new file will be created with the name
./generated-embeddings/classifier.pk
. This file contains the SVM model that will be used to recognize new faces.
From now on, you will have a working face recognizer!
Step 5. We recognize the faces!
Take a new picture with an unknown face. Pass it through the classifier script, such as the following:
./demos/classifier.py infer ./generated-embeddings/classifier.pkl your_test_image.jpg
You should get something like this warning:
=== /test-images/will-ferrel-1.jpg === Predict will-ferrell with 0.73 confidence.
Here, if you wish, you can customize the python script
./demos/classifier.py
.
Important notes:
• If the results are not satisfactory, then try adding a few more images for each person in step 3 (especially images from different directions).
• This script will always issue a warning, even if it does not know this face. In real use, you need to check the degree of confidence and remove warnings with a low confidence level, since they are most likely incorrect.