📜 ⬆️ ⬇️

Results of the development of computer vision in one year

Part one. Classification / localization, object detection and object tracking

This excerpt is from a recent publication that was compiled by our research team in the field of computer vision. In the coming months, we will publish papers on various research topics in the field of Artificial Intelligence — about its economic, technological, and social applications — with the goal of providing educational resources for those who want to learn more about this amazing technology and its current state. Our project hopes to contribute to the growing mass of work that provides all researchers with information on the latest AI developments.

Introduction


Computer vision is usually called the scientific discipline that gives cars the ability to see, or more colorfully, allowing machines to visually analyze their environment and stimuli in it. This process usually involves evaluating one or more images or videos. The British Association of Machine Vision (BMVA) defines computer vision as "automatic extraction, analysis and understanding of useful information from an image or its sequence . "

The term understanding is interestingly distinguished from the mechanical definition of vision — and at the same time it demonstrates the significance and complexity of the field of computer vision. A true understanding of our environment is not only achieved through a visual presentation. In fact, the visual signals pass through the optic nerve into the primary visual cortex and are interpreted by the brain in a highly stylized sense. The interpretation of this sensory information encompasses almost the entire totality of our natural embedded programs and subjective experience, that is, how evolution has programmed us to survive and what we learned about the world throughout life.

In this regard, vision refers only to the transmission of images for interpretation; and computing indicates that the images are more like thoughts or consciousness, relying on the many abilities of the brain. Therefore, many believe that computer vision, a true understanding of the visual environment and its context, paves the way for future variations of Strong Artificial Intelligence thanks to a perfect mastery of work in cross-domain areas.
')
But don’t grab hold of a weapon, because we haven’t practically reached the embryonic stage of development of this amazing area. This article should just shed some light on the most significant achievements of computer vision in 2016. And it is possible to try to inscribe some of these achievements in a sound mixture of expected short-term social interactions and, where applicable, hypothetical predictions of the completion of our life as we know it.

Although our works are always written in the most accessible way, the sections in this particular article may seem a bit unclear due to the subject matter. We everywhere offer definitions at a primitive level, but they only provide a superficial understanding of key concepts. Concentrating on the works of 2016, we often make gaps for the sake of brevity.

One of these obvious omissions relates to the functionality of convolutional neural networks (CNN), which are commonly used in computer vision. AlexNet's success in 2012, the architecture of CNN, which stunned competitors in the ImageNet competition, was a testament to a revolution that de facto took place in this area. Subsequently, numerous researchers began using CNN-based systems, and convolutional neural networks have become a traditional technology in computer vision.

More than four years have passed, and the CNN variants still make up the bulk of the new neural network architectures for computer vision tasks. Researchers remake them as designer cubes. This is real proof of the power of both open source scientific publications and in-depth training. However, the explanation of convolutional neural networks will easily stretch into several articles, so it is better to leave it for those who are more deeply versed in the subject and have a desire to explain complex things in a clear language.

For ordinary readers who wish to quickly understand the topic before continuing with this article, we recommend the first two sources listed below. If you want to dive even more into the subject, then for this we also cite other sources:


For a more complete understanding of neural networks and deep learning in general, we recommend:


In general, this article is scattered and spasmodic, reflecting the authors' admiration and the spirit of how it is supposed to be used, section by section. The information is divided into parts in accordance with our own heuristics and judgments, a necessary compromise due to the cross-domain influence of so many scientific papers.

We hope that readers will benefit from our generalization of information, and it will allow them to improve their knowledge, regardless of the previous baggage.

On behalf of all participants
The m tank

Classification / localization


The task of classification with respect to images is usually to assign a label to a whole image, for example, “cat”. With this in mind, localization can mean determining where the object is in this image. Usually, it is indicated by a certain bounding box around the object. The current classification methods on ImageNet are already superior to groups of specially trained people in the accuracy of object classification.

Fig. 1 : Computer vision tasks

Source : Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016) cs231n, lecture 8 - slide 8, spatial localization and detection (01/02/2016), pdf

However, increasing the number of classes is likely to provide new metrics for measuring progress in the near future. In particular, Francois Chollet, the creator of Keras , has applied new techniques , including the popular Xception architecture, to Google’s internal data set with more than 350 million images with multiple tags containing 17,000 classes.

Fig. 2 : Classification / localization results from the ILSVRC competition (2010–2016)


Note : ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The improvement in results after 2011–2012 is due to the advent of AlexNet. See a review of competition requirements regarding classification and localization.
Source : Jia Deng (2016). Localization of objects ILSVRC2016: introduction, results. Slide 2, pdf

Interesting excerpts from ImageNet LSVRC (2016):


Object detection


As you might guess, the object detection process does exactly what it needs to do — it detects objects in images. The definition of object detection from ILSVRC 2016 includes the issuance of bounding boxes and labels for individual objects. This differs from the classification / localization task, since here classification and localization are applied to many objects, and not to one dominant object.

Fig. 3 : Object detection where face is the only class

Note : The picture is an example of face detection as the detection of objects of the same class. The authors call one of the constant problems in this area the detection of small objects. Using small faces as a test class, they explored the role of invariance in size, image resolution, and contextual reasoning.
Source : Hu, Ramanan (2016, p. 1)

One of the main trends in object detection in 2016 was the transition to faster and more effective detection systems. This can be seen in approaches such as YOLO, SSD and R-FCN as a step to joint calculations on the entire image as a whole. In this way, they differ from resource-intensive subnets associated with R / CNN Fast / Faster techniques. This technique is commonly called end-to-end training / learning.

In essence, the idea is to avoid using separate algorithms for each of the sub-problems in isolation from each other, since this usually increases the learning time and reduces the accuracy of the neural network. It is said that such an adaptation of neural networks to work from start to finish usually occurs after the operation of the initial subnets and, thus, is a retrospective optimization. However, the Fast / Faster R-CNN techniques remain highly efficient and are still widely used for object detection.

Source: https://habr.com/ru/post/346140/


All Articles