📜 ⬆️ ⬇️

With a beard, in sunglasses and in profile: difficult situations for computer vision



Technologies and models for our future computer vision system were created and improved gradually in various projects of our company - in Mail, Cloud, Search. Matured as a good cheese or brandy. Once we realized that our neural networks showed excellent results in recognition, and decided to combine them into a single b2b product - Vision - which we now use ourselves and offer to use you.

Today, our computer vision technology on the Mail.Ru Cloud Solutions platform works successfully and solves very complex practical problems. It is based on a series of neural networks that are trained on our data sets and specialize in solving applied problems. All services revolve on our server capacities. You can integrate into your applications the public API Vision, through which all the features of the service are available. The API is fast — thanks to the server GPU, the average response time inside our network is 100 ms.
')
Come under the cat, there is a detailed story and many examples of the work of Vision.

As an example of a service in which we ourselves use the above-mentioned facial recognition technologies, we can cite Events . One of its components are Vision photo stands, which we put on at various conferences. If you come to such a photo stand, take a photo of the built-in camera and enter your e-mail, the system will immediately find among the photo array those from which regular conference photographers captured you, and if desired, send the photos to you by e-mail. And it's not about staged portrait shots - Vision recognizes you even in the background in a crowd of visitors. Of course, they are not recognized by the photo stands, these are just tablets in beautiful stands that simply photograph guests on their built-in cameras and transmit information to servers, where all the recognition magic happens. And we have repeatedly observed how surprising the effectiveness of technology is, even among image recognition specialists. Below we describe some examples.


1. Our Face Recognition Model


1.1. Neural network and processing speed


For recognition, we use the modification of the ResNet 101 neural network model. At the end, the Average Pooling is replaced by a fully connected layer, similar to how it was done in ArcFace. However, the size of vector representations is 128, not 512. Our training sample contains about 10 million photos of 273,593 people.

The model works very fast thanks to a carefully selected server configuration architecture and GPU computing. On receipt of a response from the API in our internal networks goes from 100 ms - this includes face detection (face detection in the photo), recognition and return of PersonID in the API response. With large amounts of incoming data - photos and videos - much more time will be spent on the transfer of data to the service and on the arrival of a response.

1.2. Model Evaluation


But the determination of the effectiveness of the work of neural networks is a very ambiguous task. The quality of their work depends on which data sets the models were trained on and whether they were optimized for working with specific data.

We started to evaluate the accuracy of our model with the popular LFW verification test, but it is too small and simple. After achieving 99.8% accuracy from him there is no sense. There is a good competition for evaluating recognition models - Megaface on it, we gradually got to 82% rank 1. The Megaface test consists of a million photographs - distractors - and the model should be able to well distinguish several thousand photographs of celebrities from Facescrub from distractors. However, after clearing the Megaface test from errors, we found out that on the cleaned version we achieve an accuracy of 98% rank 1 (photos of celebrities are generally quite specific). Therefore, they created a separate identification test, similar to Megaface, but with photos of “ordinary” people. Further improved the accuracy of recognition on their datasets and went far ahead. In addition, we use the clustering quality test, which consists of several thousand photographs; it simulates the layout of faces in the user's cloud. In this case, clusters are groups of similar individuals, in a group for each recognizable person. The quality of work we tested on these groups (true).

Of course, any model has a recognition error. But such situations are often resolved by fine-tuning thresholds for specific conditions (for all conferences we use the same thresholds, and, for example, for ACS, thresholds have to be greatly increased so that there are fewer false positives). The vast majority of conference attendees were recognized by our Vision photo stands correctly. Sometimes someone looked at the cropped thumbnail and said: "Your system was wrong, it's not me." Then we opened the whole photo, and it turned out that the visitor actually has this photograph, only they didn’t take it, but someone else, just a person happened to be in the background in the background area. Moreover, the neural network often correctly recognizes even when part of the face is not visible, or the person is standing in profile, or even halfway around. The system can recognize a person, even if a person has fallen into the field of optical distortion, say, when shooting with a wide-angle lens.

1.3. Examples of testing in difficult situations


Below are examples of the work of our neural network. At the entrance are photos, which she must mark with PersonID - a unique identifier of the person. If two or more images have the same ID, then, according to the models, one person is depicted on these photos.

Immediately, we note that during testing, various parameters and thresholds of models are available to us, which we can customize to achieve one or another result. The public API is optimized for maximum accuracy on common cases.

Let's start with the simplest, with face recognition in the face.



Well, that was too easy. Complicate the task, add a beard and a handful of years.



Someone will say that this was also not too difficult, because in both cases the face is seen entirely, the algorithm has a lot of information about the face. Okay, turn Tom Hardy in profile. This task is much more difficult, and we spent a lot of effort on its successful solution while maintaining a low level of errors: we selected a training sample, thought through the neural network architecture, perfected loss functions and improved the preliminary processing of photographs.



Wear a hat on him:



By the way, this is an example of a particularly difficult situation, since the face is strongly covered here, and in the lower picture there is also a deep shadow that hides the eyes. In real life, people very often change their appearance with the help of dark glasses. Do the same with Tom.



Well, let's try to put photos from different ages, and this time we will put the experience on another actor. Let us take a much more complicated example, when age-related changes appear especially bright. The situation is not far-fetched, it occurs quite often when you need to compare the photo in the passport with the person of the presenter. After all, the first photo in the passport is glued when the owner is 20 years old, and by 45 people can be very much transformed:



Do you think that the main specialist for impossible missions has not changed much with age? I think that even a few people would combine the top and bottom photos, so much the boy has changed over the years.



Neural networks are faced with changes in appearance much more often. For example, sometimes women can greatly change their image with the help of cosmetics:



Now we will further complicate the task: let different parts of the face be covered in different photos. In such cases, the algorithm cannot compare the entire samples. However, Vision copes well with such situations.



By the way, there are a lot of faces on the photograph, for example, in the general picture of the hall more than 100 people can fit. This is not an easy situation for neural networks, since many faces may be differently illuminated, someone outside the sharpness zone. However, if the photo is taken with sufficient resolution and quality (at least 75 pixels per square, covering the face), Vision will be able to identify it and recognize it.



The peculiarity of reportage photos and images from surveillance cameras is that people are often blurry because they are out of the field of sharpness or moved at this moment:



Also, the intensity of illumination can vary greatly from image to image. This also often turns into a stumbling block, many algorithms have great difficulty with correct processing of too dark and too light images, not to mention accurate matching. Let me remind you that in order to achieve such a result, it is necessary to set thresholds in a certain way, this opportunity is not available in public access yet. For all clients, we use the same neural network, it has thresholds that are suitable for most practical tasks.



Recently, we rolled out a new version of the model that accurately recognizes Asian faces. Previously, it was a big problem, which even got the name "racism of machine learning" (or "neural networks"). European and American neural networks well recognized Caucasoid faces, and with Mongoloid and Negroid everything was much worse. Probably in the same China, the situation was exactly the opposite. It's all about the training data sets, which reflect the dominant types of people in a particular country. However, the situation is changing, today this problem is far from being so acute. Vision has no difficulty with people of different races.



Face recognition is just one of the many applications of our technology; Vision can be taught to recognize anything. For example, car numbers, including those in conditions that are difficult for algorithms: at acute angles, dirty and difficult to read numbers.



2. Practical examples of use


2.1. Control of physical access: when two people go through one pass


With the help of Vision, you can implement the accounting system for the arrival and departure of employees. The traditional system based on electronic passes has obvious drawbacks, for example, you can go together on one badge. If the access system (ACS) is supplemented with Vision, it will honestly record who and when came / left.

2.2. Time tracking


This scenario of using Vision is closely related to the previous one. If we supplement the access system with our face recognition service, it will be able not only to notice violations of access control, but also to register the actual presence of employees in the building or on the object. In other words, Vision will help to honestly take into account who came to work and how much and left with her, and who generally skipped, even if his colleagues covered him in front of his superiors.

2.3. Video analytics: tracking people and security


By tracking people with the help of Vision, you can accurately assess the real permeability of shopping areas, railway stations, crossings, streets and many other public places. Also, our tracking can be of great help in controlling access, for example, to a warehouse or other important premises. And of course, tracking people and individuals helps to solve security problems. Caught someone stealing at your store? Put it on PersonID, who returned Vision, to the black list of your video analytics software, and the next time the system will immediately alert the security if this type appears again.

2.4. In trade


Retail and various service businesses are interested in queuing recognition. With the help of Vision, you can recognize that this is not a random gathering of people, namely, a queue, and determine its length. And then the system informs the person in charge of the occurrence of the queue in order to understand the situation: either it is an influx of visitors and you need to call additional workers, or someone tinkers with their work duties.

Another interesting task is to separate the company's employees in the hall from visitors. Usually, the system is trained to separate objects in a certain dress (dress code) or with some distinctive feature (brand handkerchief, badge on the chest, and so on). This helps to more accurately assess attendance (so that employees, by their very presence, do not “twist” the statistics of people in the hall).

With the help of face recognition, you can also evaluate your audience: what is the loyalty of visitors, that is, how many people return to your institution and with what frequency. Calculate how many unique visitors come to you in a month. To optimize the costs of attraction and retention, you can find out and change in attendance depending on the day of the week and even the time of day.

Franchisors and network companies can order an assessment of the quality of branding of various retail outlets from photographs: the presence of logos, signs, posters, banners, and so on.

2.5. On transport


Another example of security with the help of video analytics is the definition of abandoned things in the halls of airports or train stations. Vision can be trained to recognize objects of hundreds of classes: furniture, bags, suitcases, umbrellas, various types of clothing, bottles, and so on. If your video analytics system detects an orphaned object and recognizes it using Vision, then it signals the security service. A similar task is connected with the automatic detection of non-standard situations in public places: someone became ill, or someone smoked in the wrong place, or a person fell on the rails, and so on - all these patterns of video analytics can be recognized through the API Vision.

2.6. Document flow


Another interesting future application of Vision, which we are currently developing, is the recognition of documents and their automatic parsing into databases. Instead of manually typing (or worse, entering) endless series, numbers, issue dates, account numbers, bank details, dates and places of birth, and many other formalized data, you can scan documents and automatically send them via a secure channel through the API to the cloud, where the system will recognize these documents on the fly, send it and return a response with the data in the correct format for automatic entry into the database. Today, Vision is already able to classify documents (including in PDF) - it distinguishes between passports, SNILS, TIN, birth certificates, marriage certificates and others.

Of course, all these situations a neural network is not able to handle out of the box. In each case, a new model is built for a specific customer, many factors, nuances and requirements are taken into account, data sets are selected, training-testing-tuning iterations are carried out.

3. API operation diagram


The Vision Gateway for users is the REST API. At the entrance, it can receive photos, video files and broadcasts from network cameras (RTSP streams).

To use Vision, you need to register in the Mail.ru Cloud Solutions service and get access tokens (client_id + client_secret). User authentication is performed using the OAuth protocol. Source data in the bodies of POST requests are sent to the API. And in response, the client receives from the API the result of recognition in JSON-format, and the answer is structured: it contains information about the objects found and their coordinates.



Response example
{ "status":200, "body":{ "objects":[ { "status":0, "name":"file_0" }, { "status":0, "name":"file_2", "persons":[ { "tag":"person9" "coord":[149,60,234,181], "confidence":0.9999, "awesomeness":0.45 }, { "tag":"person10" "coord":[159,70,224,171], "confidence":0.9998, "awesomeness":0.32 } ] } { "status":0, "name":"file_3", "persons":[ { "tag":"person11", "coord":[157,60,232,111], "aliases":["person12", "person13"] "confidence":0.9998, "awesomeness":0.32 } ] }, { "status":0, "name":"file_4", "persons":[ { "tag":"undefined" "coord":[147,50,222,121], "confidence":0.9997, "awesomeness":0.26 } ] } ], "aliases_changed":false }, "htmlencoded":false, "last_modified":0 } 


In the answer there is an interesting parameter awesomeness - this is the conditional “coolness” of the face in the photo, with its help we choose the best picture of the face from the sequence. We have trained the neural network to predict the likelihood that a picture will be liked in social networks. The better the shot and smiling face, the greater the awesomness.

API Vision uses such a thing as space. This is a tool for creating different sets of faces. Examples of spaces are black and white lists, lists of visitors, employees, clients, etc. For each token in Vision, you can create up to 10 spaces, each space can have up to 50 thousand PersonID, that is, up to 500 thousand per one token . At the same time, the number of tokens per account is unlimited.

Today, the API supports the following methods of definition and recognition:


We will also soon finish work on methods for OCR, determining gender, age and emotions, as well as solving merchandising problems, that is, to automatically control the display of goods in stores. Full API documentation can be found here: https://mcs.mail.ru/help/vision-api

4. Conclusion


Now through the public API, you can get access to face recognition in photos and videos, supports the definition of various objects, car numbers, attractions, documents and entire scenes. Application scenarios - the sea. Come, test our service, put before it the most tricky tasks. The first 5,000 transactions are free. Perhaps it will be the “missing ingredient” for your projects.

Access to the API can be instantly obtained when registering and connecting Vision . All habra users - promo code for additional transactions. Write in a personal email address to which you registered your account!

Source: https://habr.com/ru/post/449120/


All Articles