📜 ⬆️ ⬇️

Meta-clustering with error minimization, and why I think the brain works that way

Hello to all! I want to share with you my idea of ​​machine learning.

Great achievements in the field of machine learning are impressive. Convolutional networks and LSTM are cool. But almost all modern technologies are based on the reverse propagation of error. Based on this method, it is unlikely to build a thinking machine. Neural networks are something of a frozen brain, trained once and for all, unable to change to reflect.

I thought, why not try to create something like a living brain. A sort of reengineering. Since in all animals, despite differences in intelligence, the brain consists of approximately identical neurons, some basic principle must lie at the heart of its work.

What I do not know about neurons


There are several questions to which I have not found definite answers in popular literature;
')

Just clustering


The plausible answer to all these questions seems to me that the brain works like a multitude of simple clusterizers. Is it possible to perform such an algorithm on a group of neurons? For example, the K-means method. Completely, you just need to simplify it a bit. In the classical algorithm, the centers are calculated iteratively as the average of all the considered examples, but we will shift the center immediately after each example.

Let's see what we need to implement the clustering algorithm.


Check the resulting algorithm in practice. I scribbled a few lines on Python. This is what happens with two measurements from random numbers:


But MNIST:


At first glance, it seems that all of the above has not changed anything. Well, we had some data at the input, we somehow converted it, got other data.

But in fact there is a difference. If before the conversion, we had a bunch of analog parameters, then after the conversion we have only one parameter, with a coded unitary code. Each neuron in the group can be associated with a specific action.

I will give an example: Suppose there are only two neurons in a clustering group. Let's call them "TASTY" and "SCARY." To allow the brain to make a decision, it is necessary only to connect the neuron “EAT” to the first one, and “RUN” to the second one. For this we need a teacher. But this is not about that, training with a teacher is a topic for another article.

If you increase the number of clusters, then the accuracy will gradually increase. The extreme case is the number of clusters equal to the number of examples. But there is a problem, the number of neurons in the brain is limited. You need to constantly compromise, either accuracy or brain size.

Meta clustering


Suppose that we have not one clustering group, but two. In this case, the inputs are fed the same values. Obviously, you get the same result.

Let's make a small random error. Let, sometimes, each clusteriser selects not the closest center of the cluster, but which one. Then the values ​​will begin to differ, over time the difference will accumulate.


And now, let's calculate the error for each clusteriser. Error is the difference between the input example and the center of the selected cluster. If one clusteriser chooses the nearest value, and the other random, then the error of the second will be greater.

Go ahead, add a mask to the input of each clusteriser. A mask is a set of coefficients for each input. Not zero or one, as commonly used in masks, but a real number from zero to one.

Before giving an example to the input of the clusterer, we will multiply this example by the mask. For example, if a mask is used for a picture, then if for some pixel the mask is equal to one, then it is as if completely transparent. And if the mask is zero, then this pixel is always black. And if the mask is 1/2, then the pixel is darkened by half.

And now the main action, we will reduce the mask value in proportion to the error of the clusterizer. That is, if the error is large, then we will reduce the value more strongly, and if it is zero, then we will not reduce it at all.

To ensure that the values ​​of the masks are not gradually zeroed, we will normalize them. That is, the sum of the values ​​of the masks for each input parameter is always one. If something is taken away in one mask, this is added to another.

Let's try to see what happens by the example of MNIST. We see that the masks gradually divide the pixels into two parts.


The right side of the picture shows the resulting masks. At the end of the process, the top clusteriser examines the bottom right, and the bottom clusteriser considers the rest of the examples being submitted. Interestingly, if we rerun the process, we’ll get another separation. But at the same time, the groups of parameters are not obtained as they were, but so as to reduce the prediction error. Clusterizers, as it were, try on each pixel to their mask, and at the same time, the pixel takes that clusterizer to which the pixel fits better.

Let's try to submit to the input double numbers, not superimposed on each other, but located side by side, here are the ones (this is one example, not two):

image

Now we see that every time, the separation is the same. That is, if there is a single, clearly the best way to separate the masks, then it will be selected.


Only one will be random, whether the first mask chooses the left figure or the right one.

I call the resulting masks meta-clusters. And the process of forming masks by meta-clustering. Why meta? Because clustering is not the input examples, but the inputs themselves.

The example is more complicated. Let's try to divide 25 parameters into 5 meta-clusters.

To do this, take five groups of five parameters, encoded with a unitary code.

That is, in each group one and only one unit in a random place. In each given example, always five units.

In the pictures below, each column is an input parameter, and each row is a meta-cluster mask. The clusters themselves are not shown.


100 parameters and 10 meta clusters:


Works! In some places, even a bit like the image of the matrix from the movie of the same name.

Using meta-clustering can drastically reduce the number of clusters.

For example, take ten groups of ten parameters, in each group one unit.

If we have one clusterizer (there are no meta-clusters), then we need 10 10 = 10000000000 clusters to get a zero error.

And if we have ten clusterizers, then we need only 10 * 10 = 100 clusters. This is similar to the decimal number system, no need to come up with notation for all possible numbers, you can do with ten numbers.

Meta clustering is very well parallelized. The most costly computations (comparison of the example with the cluster center) can be performed independently for each cluster. Notice, not for the clusteriser, but for the cluster.

How it works in the brain


Before that, I was talking only about dendrites, but neurons have axons. And they also learn. So, it seems that axons are masks of meta-clusters.

Add to the description of the work of the dendrites, above, another function.

Suppose that if a neuron spike occurs, all the dendrites somehow throw something into the synapse that indicates the concentration of the neurotransmitter in the dendrite. Not from an axon to a dendrite, but back. The concentration of this substance depends on the comparison error. Let, the smaller the error, the greater the amount of substance emitted. Well, the axon reacts to the amount of this substance and grows. And if the substance is small, which means a big mistake, then the axon gradually decreases.

And if axons are changed in this way from the very birth of the brain, then over time, they will only go to those groups of neurons where their spikes of these axons are needed (do not lead to big mistakes).

Example: let it be necessary to memorize human faces. Let each face be depicted using a megapixel image. Then for each person you need a neuron with a million dendrites, which is unrealistic. And now we divide all pixels into meta-clusters, such as eyes, nose, ears, and so on. Only ten such meta-clusters. Let there be ten clusters in each meta-cluster, ten variants of the nose, ten variants of the ears, and so on for everything. Now, to memorize a face, a neuron with ten dendrites is enough. This reduces the amount of memory (well, brain volume) by five orders of magnitude.

Conclusion


And now, if we assume that the brain consists of meta-clusters, we can try to consider from this point of view some concepts inherent in the living brain:

Clusters must be constantly trained, otherwise new data will not be processed correctly. For learning clusters in the brain, a balanced sample is needed. Let me clarify that if it is winter now, then the brain will learn only from winter examples, and the resulting clusters will gradually become relevant only to winter, and in the summer everything will be bad for this brain. What to do with it? It is necessary to submit periodically to all clusterizers not only new, but also old important examples (memories of both winter and summer). And so that these memories do not interfere with the current sensations, you need to temporarily disable the senses. In animals, this is called a dream .

Imagine the brain is seeing something small, GRAY, that is running. After meta-clustering, we have three active neurons in three meta-clusters. And thanks to the memory, the brain knows that it is delicious. Then, the brain sees something small, BLUE, that runs. But the brain does not know whether it is tasty or scary. It is enough to temporarily disable the meta-cluster where the colors are located, and only the small one that runs will remain. And the brain knows it is delicious. This is called an analogy .

Suppose the brain remembered something, and then changed the active neuron-cluster in some group to any other, while in the other meta-clusters there remains a real memory. And so, the brain has already presented something that has never seen before. And this is the imagination .

Thank you for your attention, the code is here .

Source: https://habr.com/ru/post/427407/


All Articles