Computer vision: how AI is watching us

Recently, we talked about how we are analyzed in cinemas using computer vision technology: emotions, gestures, and that’s all. Today we are publishing a conversation with our colleague from Microsoft Research. He is creating the very same vision. Under the cat, details about the development of technology, a little about the GDPR, as well as about the applications Join now!

From a technical point of view, computer vision experts "create algorithms and systems for automatically analyzing images and extracting information from the visible world." From the layman’s point of view, they create machines that they can see. This is what the main research officer and head of the research department, Dr. Gang Hua, and a team of computer vision experts are doing. For devices such as personal robots, unmanned vehicles and drones, which we encounter more and more often in everyday life, vision is very important.
')
Today, Dr. Hua will tell us how recent advances in AI and machine learning have helped to improve image recognition technology and “understanding” video, as well as contributed to the development of art. He will also explain the essence of the distributed ensemble approach to active learning, in which people and machines work together in the lab to create computer vision systems that can see and recognize an open world.

Gan Hua, Principal Researcher and Head of Research and Development. Photo courtesy of Maryatt Photography.

Interview

If we look back ten or fifteen years ago, we will see that there was more diversity in the computer vision community. In order to examine the problem from different sides and find its solution, various methods of machine learning and knowledge from various fields, such as physics and optics, were used. We stress the importance of diversity in all areas of activity, so I think the scientific community will benefit if we have more different points of view.

We introduce you to advanced technology research and the scientists behind it.

From a technical point of view, computer vision experts "create algorithms and systems for automatically analyzing images and extracting information from the visible world." From the layman’s point of view, they create machines that they can see. This is what the main research officer and head of the research department, Dr. Gan Hua, and a team of computer vision experts are doing. For devices such as personal robots, unmanned vehicles and drones, which we encounter more and more often in everyday life, vision is very important.

Today, Dr. Hua will tell us how recent advances in AI and machine learning have helped to improve image recognition technology and “understanding” video, as well as contributed to the development of art. He will also explain the essence of the distributed ensemble approach to active learning, in which people and machines work together in the lab to create computer vision systems that can see and recognize an open world. This and much more is in the new release of the Microsoft Research podcast.

You are the chief scientist and head of research at MSR (Microsoft Research), and your specialty is computer vision.

Yes.

If in general terms, why does a computer vision specialist get up in the morning? What is its main goal?

Computer vision is a relatively young area of research. In short, we are trying to create such machines that will be able to see the world and perceive it just like a person. Speaking more technical language, information that enters the computer in the form of simple images and video, can be represented as a sequence of numbers. We want to extract from these numbers some structures that describe the world, some semantic information. For example, I can say that some part of the image corresponds to a cat. And the other part corresponds to the car, I mean the interpretation of this kind. Here it is, the goal of computer vision. It seems simple to people, but in order to teach computers this, we had to do a lot of work in the last 10 years. However, computer vision as a field of research for 50 years. Nevertheless, we still have many problems to solve.

Yes. 5 years ago you said the following: I paraphrase: “Why, after 30 years of research, are we still working on the problem of facial recognition?” Tell us how you answered this question then and what has changed during that time.

If we answer from the perspective of five years ago, I would say that in the 30 years that have passed since the start of research in the field of computer vision and facial recognition, we have achieved a lot. But for the most part we are talking about a controlled environment, where when capturing faces, you can adjust the lighting, camera, scenery, and the like. Five years ago, when we began to work more in natural conditions, in an uncontrollable environment, it turned out that there is a huge gap in the accuracy of recognition. However, over the past five years, our community has made great progress by using more advanced in-depth training methods. Even in the field of face recognition in natural conditions, we have made progress and really came to the point where it became possible to use these technologies for various commercial purposes.

It turns out that deep learning has really allowed us to achieve great success in the areas of computer vision and image recognition over the past few years.

Right.

When we started talking about the difference in conditions in fully controlled and unpredictable environments, I remembered several scientists, guests of the podcast, who noted that computers fail when the data are not enough ... for example, the sequence “dog, dog, dog, dog with three legs "- the computer begins to doubt whether the latter is also a dog?

Yes.

After all the truth? So, what exactly is inaccessible earlier, deep learning methods allow you to do today in the field of recognition?

This is a great question. From a research perspective, deep learning offers several possibilities. First, it is possible to conduct comprehensive training in order to determine the correct representation of the semantic image. For example, back to the dog. Suppose we look at different photos of dogs, for example, images of 64 × 64 pixels, where each pixel can take about two hundred and fifty different values. If you think about it, this is a huge number of combinations. But if we speak of a dog as a pattern, where the pixels correlate with each other, then the number of combinations corresponding to the “dog” will be much less.

With the help of complex methods of deep learning, you can teach the system to determine the correct numerical representation of the “dog”. Due to the depth of the structures, we can create really complex models that can master a large amount of data for training. Thus, if my training data covers all possible variants and representations of a template, then in the end I will be able to recognize it in a wider context, because I have considered almost all possible combinations. This is the first.

Another opportunity for deep learning is a kind of compositional behavior. There is a structure layer and a presentation layer, therefore, when information or an image gets into deep networks and the extraction of low-level primitive images begins, the model can gradually collect semantic structures of higher and higher complexity from these primitive images. In-depth learning algorithms identify smaller patterns that correspond to larger patterns and put them together to form the final pattern. Therefore, it is a very powerful tool, especially for tasks of visual recognition.

So, this means that the main topic of the CVPR conference is the recognition of patterns by computer vision.

Yeah, right.

And pattern recognition is what technology really aspires to.

Yes of course. In fact, the goal of computer vision is to grasp the meaning in pixels. If we speak from a technical point of view, the computer needs to understand what the image is, and we get a certain numerical or symbolic result from it. For example, a numerical result can be a three-dimensional cloud of points, which describes the structure of space or the shape of an object. It can also be associated with some semantic labels, such as "dog" or "cat", as I said earlier.

Clear. So let's talk a little about tags. An interesting and important feature of the machine learning process is the fact that the computer needs to provide both pixels and tags.

Yes of course.

You talked about three things that are most interesting to you in the context of computer vision. Videos, faces, as well as art and multimedia. Let's talk about each of them separately, and let's start with your current research, with what you call “understanding” the video.

Yes. The expression "video understanding" speaks for itself. We use video instead of images as input. It is important not only to recognize the pixels, but also to consider how they move. For computer vision, image recognition is a spatial problem. In the case of video, it becomes a space-time, because the third, temporary, dimension appears. And if you look at many real-world tasks related to streaming video, be it indoor surveillance cameras or road cameras on the highway, then the point is that the object moves within a constant flow of personnel. And we need to extract information from this stream.

These cameras create a huge amount of video. Security cameras, taking pictures around the clock in supermarkets and the like. What benefits for people can you get from these records?

My team is working on one incubation project, under which we create a fundamental technology. In this project we are trying to analyze traffic on the roads. In cities, a huge number of road cameras have been installed, but most of the video they recorded is wasted. However, these cameras can be helpful. Let's look at one example: you want to control traffic lights more efficiently. Usually, the change of red and green signals is determined by the established schedule. However, if I saw that there are much fewer vehicles moving in one direction than in others, then in order to optimize the movement, I could keep the green color on in the overloaded directions longer. This is just one of the applications.

Please embody this idea!

We will try!

Which of us did not stand at the red signal of the traffic light, although almost nobody went to the green in another direction?

That's it!

Just now, you ask yourself: why do I have to wait?

I agree. This technology can also be applied in other cases, for example, when we have accumulated large archives of video. Suppose citizens asked for additional bike lanes. We could use video footage, analyze traffic data, and then decide whether to make a bike lane in this place. By implementing this technology, we could significantly affect traffic flows and help cities make such decisions.

I think this is a great idea, because in most cases we make such decisions based on our own ideas, and not on data, looking at which we could say: "Hey, you know, here the bike lane would have by the way. And here it will only complicate the movement. ”

Exactly. Sometimes for this use other sensors. They hire a company that installs special equipment on the roads. But it is economically inefficient. But the road cameras are already installed and just hang around. Video streams are already available. Right? So why not take advantage of this?

Agree. This is a great example of how machine learning and “understanding” video can be applied.

Exactly.

So, another important application is face recognition. We again return to the question "Why are we still working on the problem of facial recognition?".

Exactly.

By the way, such technologies in some cases can be applied in a very interesting way. Tell us what is happening in the field of facial recognition. Who does this and what's new?

Looking back, the facial recognition technology was studied by Microsoft when I was still working at Live Labs Research. Then we created the first face recognition library that could be used by various product development teams. For the first time, this technology began to be used in the Xbox. Then the developers tried to use facial recognition to automatically log into the system. I think it was the first time. Over time, the center for the study of facial recognition shifted to Microsoft Research Asia, where we still have a group of researchers with whom I work.

We are constantly trying to expand the boundaries of the possible. Now we work together with technical services that help us collect more data. Based on this data, we train more advanced models. Recently, we have focused on the direction of research, which we call "the synthesis of persons with preservation of recognition." The community of in-depth training experts has also achieved great success. They use deep networks to train generative models that can model the distribution of images so that data can be extracted from it, that is, it can actually synthesize an image. So you can create deep networks that create images.

But we want to go one step further. We want to synthesize faces. At the same time, we want to preserve the recognition of these individuals. Our algorithms should not just create an arbitrary set of faces without any semantic meaning. Suppose we want to recreate the face of Brad Pitt. You need to create a face that really looks like him. If you need to recreate the face of a person I know, then the result must be accurate.

So you want to preserve the recognition of the face you are trying to recreate?

Right.

By the way, I wonder if this technology will work for a long time, as a person ages, or will you have to constantly update the database with people?

This is a very good question. We are currently conducting research to solve this problem. At the current level of technology, it is still necessary to update the database from time to time. Especially if the face has changed a lot. For example, if a plastic surgery was performed, the modern system will not be able to produce the correct result.

Wait, it's not you.

Yes, absolutely not like. This issue can be approached from several sides. Human faces do not really change very much between the ages of 17–18 and about 50. But what happens immediately after birth? The faces of children vary greatly because bones grow and the shape of the face and skin change. But as soon as a person grows up and goes into a stage of maturity, changes begin to occur very slowly. Now we are conducting research in which we develop models of the aging process. They will help create an improved face recognition system with age. In fact, it is a very useful technology that can be applied in law enforcement, for example, in order to recognize children who were abducted many years ago, which ...

Look very different.

Yes, they look different. If clever face recognition algorithms could look at the original photo ...

And to say what they would have looked like at the age of 14 if they had been abducted much earlier, or something like that?

Yes yes exactly.

This is a great use. Let's talk about another area that you are actively exploring - multimedia and art. Tell us how science intersects with art, and especially about your work in the field of deep transfer of artistic style.

Good. Take a look at people's needs. First of all, we need food, water and sleep, right? After the basic needs are satisfied, the person manifests a strong desire for art ...

And the desire to create.

And create art objects. In this area of research, we want to link computer vision with artistic objects of multimedia and art. We can use computer vision to bring people artistic enjoyment. Within the framework of a separate research project on which we have been working for the last two years, we have created a sequence of algorithms with the help of which you can create an image in any artistic style if samples of this style are provided. For example, we can create an image in the style of Van Gogh.

Van Gogh?

Yes, or any other artist ...

Renoir or Monet ... or Picasso.

Yes, any of them. Anyone you can remember ...

Interesting. Using pixels?

Yes, using pixels. This is also created by deep networks using some of the deep learning technologies that we have developed.

It seems that this study requires knowledge from a variety of areas. Where do you find professionals who can ...

I would say that in a sense, our goal is to ... You know, works of art are not always accessible to everyone. Some of the artwork is really very expensive. With the help of such digital technologies, we are trying to make such works accessible to ordinary people.

Democratize them.

Yes, democratize art, as you say.

It is impressive.

Our algorithm allows you to create a clear numerical model of each style. And we can even mix them if we want to create new styles. This is reminiscent of the creation of an artistic space where we can explore intermediate options and see how the techniques change when moving from one artist to another. And we can even take a deeper look and try to understand what exactly determines the style of an artist.

I have a particular interest in the fact that, on the one hand, we are talking about working with numbers: computer science, algorithms, and mathematics. On the other hand, speech about art is a much more metaphysical category. And yet you have combined them, and this shows that the brain of a scientist can have an artistic side.

Exactly. I think that the most important tool we use that helped put everything together is statistics.

Interesting.

All kinds of algorithms for machine learning actually only collect statistics on pixels.

, … – - MSR, – . , ?

. , , -. … . , - . - , , . . .

, , Amazon Mechanical Turk. . , . . , . -, , . -, , .

. . . , . , , . .

, . . , , ?

, . , , . ( ), , , -, .

, .

Exactly. , , , , , . . , NIH, - (co-robots).

- ?

-. . , . , . , . , . , . , - , , .

, , . , , ? , , ? . . , , , .

Microsoft Research ?

Microsoft . , 2006-2009 Live Labs. . . , . Nokia Research, IBM Research …

-, ?

, -, . Microsoft Research 2015 . , 2017 .

. ?

. Microsoft Research — . . — . . . . , , , Intelligent Group , .

.

Yes.

, , . - , ? -, ?

, , . . : . , , , , , - . . , , , , . , .

… , : , , ? , , , ?

Microsoft (GDPR). , , , , . , . - -, . . , - . , ? , , . . , , , …

, . : « . ».

Yeah, right.

, , . ? 10 ?

. , . . , . . , .

, , «» . , - , . - , ? . — . , . , , . , . , . . . , …

.

That's it. . . 10-15 , , . , , . , , , .

. , , , .

, !

, , : Microsoft.com/research

Source: https://habr.com/ru/post/418251/

All Articles

Computer vision: how AI is watching us

Interview

More articles: