This is an article of level 2 (see below).The article is a logical continuation of my story
about convolutional neural networks and their applications for image recognition.
Before proceeding, I want to give an understanding of what people from the field of Machine Learning are doing and what their global goal is. The global goal is the
enslavement of all people with machines and the creation of methods and algorithms capable of building complex and nonlinear models of the external world through training. As an explanation, I propose to look at the picture, thankfully borrowed from [1]. Now humanity is already able to create algorithms that can learn simple operations, but what about such a transformation - we have an image of a seated person who is essentially a raw vector of brightness values ​​of a picture at each point. And we need to gradually increase the abstractness of these raw data to conclude that “the person is sitting”. Hence the main question itself:
How to create a system capable of not only understanding simple (albeit non-linear) dependencies, but also learning complex, multidimensional and multi-level hierarchies of representations of the real world?In this regard, it is worth noting that some difficultly formalized tasks are successfully solved by the so-called. hard (hard-wired) methods. But if the same problem was solved at the same qualitative level and with commensurate expenditure of resources, then such a result is valued more because the system gained knowledge itself in the learning process.
Returning to the topic of the SNA, let me remind you that they are successfully used in various recognition tasks, not only images, but also time sequences (speech), and with a certain improvement they can not only classify, but also generate and evaluate. The main feature of convolutional neural networks that distinguishes them from all the others is the artificial imposition of restrictions on weights, as well as layer-by-layer resizing of the input and local perception. And if the layer-by-layer resizing of the attribute map is just a trick to provide scale invariance, and the local perception is a good but rather old idea (neocognitron, 70th year), then artificial restriction or, in other words, the mechanism of shared weights is an interesting thing. A number of neuroseptic fathers (Hinton and his followers) recognized that it was still impossible to train a neural network with more than 3 layers using back-propagation normally and hardly succeeds [1], except for SNS. And as the reason for such exclusivity is called the property of separation of weights.
The idea of ​​shared weights is very close to the idea of ​​sparce features [2], the meaning of which is as follows. We have a certain learning system that receives raw data at the input, and outputs it in some of its representations, i.e. generates feature maps. The idea of ​​discharged features is that when training a system, it is necessary to force the use of as few outputs as possible in the process of presenting the input data. Such violence is usually achieved by introducing a so-called. sparcity penalty. And in this I personally see a great analogy with the way our brain works. In all the vast diversity of the world, a developing person (child) can go mad if you try to memorize everything, so it’s natural that you need to memorize signs that are most common to all the phenomena and objects around us. Thus, a certain dictionary of images is built in the head, which is then used by us to understand the rest of the world. And at each level of the hierarchy its own dictionary. An example of this is when you are shown a green apple and they say that it is a green apple, and then they show two apples and they say it is two apples. From these two situations, the brain distinguishes what was common to both and understands what an apple is. In the same way, the introduction of a fine for using too many outputs to encode data leads to a system learning how to use images that are as general as possible for all inputs.
Now I think it becomes clearer why in convolutional neural networks, artificially limited the number of weights used for processing input data.
I think so far enough, I will continue this topic in the following articles.
I offer some innovation. Artificial intelligence is a rather interesting topic for a wide range of people, and since they read it from housewives to scientists involved in this field professionally, introduce a scale from 1 to 5 levels of article availability. You can also add this as a tag, for example, the II1 is the beginner level, the II5 is the professional level. Then it will be easier for beginners to go from reviews to details. Naturally, this does not negate the fact that even level 5 articles should be written “so that my grandmother understood” (c).
[1] Yoshua Bengio, Learning deep architectures for AI (2009), in: 2: 1 (1--127), in: Foundations and Trends in Machine Learning
[2] M. Ranzato, Y. LeCun, "A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images". International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, 2007