The ImageNet competition was held in October 2012 and was dedicated to the classification of objects in photographs. The competition required image recognition in 1000 categories.
The Hinton team
used the methods of
deep learning and
convolutional neural networks , as well as the infrastructure created by Google under the guidance of Jeff Dean and Andrew Ng. In March 2013, Google invested in a Hinton startup based at the University of Toronto, thereby obtaining all the rights to the technology. Within six months, the photo search service photos.google.com was developed.
The service uses convolutional neural networks originally developed by Professor Yan Lecun in the late 1990s. Already then this technology allowed to solve problems of handwriting recognition reliably. Since then, the power of computers has significantly increased, and new algorithms for large-scale learning of neural networks have emerged.
As for the technical infrastructure, I partially described it in the article
Formation of high-level features using a large-scale learning experiment without a teacher . For a detailed description, see
article (pdf) , and I will limit myself to a few numbers. Due to the use of locally connected networks characteristic of two-dimensional image processing, it is possible to effectively use up to 32 computers with 16 cores in each, totally up to 512 cores, for training one large neural network. Due to the use of distributed algorithms for optimizing and replicating the learning parameters, the number of effectively working parallel processor cores can be increased to tens of thousands!
')
In particular, 16 million images of 100x100 pixels were used to train the network that won the ImageNet competition. The output layer of the neural network consisted of 21,000 logistic classifiers "one of all." The total number of optimized parameters (weights of the neural network) was 1.7 billion. For training, 81 machines were used - almost 1,300 cores.
The implementation of academic technologies acquired by Google less than a year ago, has allowed in the shortest possible time to develop an unsurpassed search service for unmarked images. Here are some interesting results:
Generalization
Despite the significant difference between the images in the training and test samples, the search engine copes well with the generalizations. For example, for learning the concept of “flower”, photographs of flowers taken with macro photography could be used, with an ideal composition including a single flower in the center of the frame. A trained network finds flowers in amateur photos with arbitrary composition and scale.
The image of the flower from the training set
The image on which the system found flowers
Multimodal Classes
The network was able to recognize classes of images that differ significantly in appearance. For example, the system includes both the exterior photo and the interior of the car as a class car. This is all the more surprising since in the output layer, in fact, linear classifiers are used that separate the multidimensional feature space.
Classification of abstract concepts
The system copes well with abstract or highly generalized classes, such as “dance”, “kiss”, “food”. This is interesting, because for such concepts, simple visual features, such as color, texture or shape, are not obvious.
Food is found on these images.
Meaningful mistakes
Unlike many systems of computer vision, when this system is mistaken, its errors seem to be quite reasonable. Such a mistake could well have been made by man - see, for example, the erroneous classification of a mollusk (snake) or a donkey (dog).
Banana slug mistakenly recognized as a snake
Donkey mistakenly recognized as a dog
Recognition of highly specialized classes
The system was able to recognize very specific classes, such as the types of colors (hibiscus, etc.). For a system capable of recognizing broad concepts such as Breaking Dawn, the classification of subtle attributes is amazing.
The system has determined that it is a polar bear ...
... and this is a grizzly bear