Neural network data compression

In this article I want to talk about another class of problems solved by neural networks - data compression. The algorithm described in the article does not claim to be used in real combat conditions due to the existence of more efficient algorithms. At once I will make a reservation that the discussion will deal only with lossless compression .
Most of the sources on the Internet claim that there are 3 main popular neural network architectures that solve the problem of data compression.

1. Kohonen network and its variations. Often used to compress images with loss of quality, it is not a lossless compression algorithm.

2. Associative memory (Hopfield network, Bidirectional associative memory, etc.). “Data compression” for this class of networks is a “feature”, a side effect, since their main purpose is to restore the original signal / image from noisy / damaged input data. Most often, a noisy image of the same dimension arrives at the input of these networks, so data is not about compression.
')
3. The method of "bottleneck".

The last method will be discussed in the article.

The essence of the method

The network architecture and learning algorithm are such that a large dimension vector is required to be transmitted from the input of the neural network to its outputs through a relatively small channel size. As an analogy - an hourglass, a watering can, a bottleneck, which is closer to whom. To implement this kind of compression, you can use a multilayer perceptron of the following architecture: the number of neurons in the input and output layers are the same, between them are several hidden layers of a smaller size.
For learning, the back error propagation algorithm is used, and the sigmoid (sigmoid function, bipolar sigmoid function) is used as the activation function.
The network consists of two parts - one part works on data compression, the other - on recovery. The time spent on both procedures is about the same. In practical use, the resulting network is divided into two. The output of the first network is transmitted via the communication channel and is fed to the input of the second one, which performs decompression.

What influences learning?

An important lever of influence on the speed of learning network is skillful adjustment of the rate of learning. The simplest option — when the network error changes to a relatively small value (say, less than 0.0000001) compared to a network error in the previous era — should decrease the factor. There are more intelligent algorithms for changing the speed of learning.
And another very important element in learning is the all-powerful rand. It is adopted when creating a neuron, its weights set small random values. Sometimes it turns out that the values are randomly set so that the network becomes unteachable, or the error convergence is very small.

Implementation

When implementing the library used AForge.NET . I carried out basic experiments on vectors of length 10, then on vectors of length 20, this did not change the essence, therefore I did not increase the dimension, but there is one more reason. Why haven't I experimented with longer vectors? Why not generate a large training sample with long vectors? This is caused by ~~laziness~~ that causes some problems and inconveniences. There may be contradictions in such a sample. For example, in the input set of vectors there may be two or more identical or similar vectors, the desired output to which will be absolutely different. Then the network on such a training set will be unteachable, and it is rather difficult to find “conflicting” vectors.

Results and conclusions

For me, the most interesting question was how to build this network effectively? How many hidden layers should she have? How fast is the training? How fast is compression and recovery?
Experiments were conducted on two main possible options for building this network.

“Fast Compression”

A network with “fast compression” will be called a network where the number of neurons in hidden layers is much less (3, 5, etc. times) than in the input / output layer, and the network consists of 4 layers - input, output and two hidden (for example, 10-3-3-10). This network topology has proven very effective in some cases. First, such a network learns very quickly. This is due to the fact that the learning algorithm needs to adjust the weights of a relatively small number of connections between adjacent layers. Secondly, the speed of issuing results in the operation of the network is also very high (for the same reason). The reverse side - the network is trained only on a small sample relative to the number of neurons in the input (and, accordingly, output) layer (an acceptable number of vectors is approximately equal to half of the neurons in the first / last layer of the network). Also, do not bend the stick and make too much difference between the layers, the best option - the number of neurons in the hidden layer should be about 3 times less than in the input / output layer.

"Sequential compression"

A network with a “sequential” or “unhurried” compression will be called a network where the number of neurons in adjacent layers is not very different (up to 25%), and the network contains 4 or more hidden layers (for example, 10-8-5-3-3- 5-8-10). Such networks did not meet my expectations. Even with a small training set and the length of the vector, they learn very long. Well, and function, of course, also slower. Although they say that in real combat conditions they are more stable. But because of the very long training time, it was not possible to verify this.

About efficiency and work time

From a practical point of view, two facts are interesting - the compression ratio and the compression / recovery time. The network is almost 100% trained and functions perfectly, compressing data 3 times. The speed of work also pleases. For architecture (10-3-3-10), the full cycle (compression and recovery) execution time is 00: 00: 00.0000050. For (20-6-6-20) - 00: 00: 00.0000065. We can safely divide by 2 and get 00: 00: 00.0000025 and 00: 00: 00.0000033 to compress or restore one vector.
The tests were carried out on an AMD Athlon II x240 2.80 Ghz processor.

Example

As an example, a small commented piece of code (in C #) that describes a primitive example of creating, training, and operating a network. The network compresses the vector 3 times, it is trained in 98% of cases. For operation, you must connect the library AForge.NET.

using System; using AForge.Neuro; using AForge.Neuro.Learning; namespace FANN { class Program { static void Main( string [] args) { // 4 ( 10,3,3,10 , ) ActivationNetwork net = new ActivationNetwork((IActivationFunction) new SigmoidFunction(), 10,3,3,10); // - BackPropagationLearning trainer = new BackPropagationLearning(net); // double [][] input = new double [][] { new double []{1,1,1,1,1,1,1,1,1,1}, new double []{0,0,0,0,0,0,0,0,0,0}, new double []{1,1,1,1,1,0,0,0,0,0}, new double []{0,0,0,0,0,1,1,1,1,1}, }; // double [][] output= new double [][] { new double []{1,1,1,1,1,1,1,1,1,1}, new double []{0,0,0,0,0,0,0,0,0,0}, new double []{1,1,1,1,1,0,0,0,0,0}, new double []{0,0,0,0,0,1,1,1,1,1}, }; // , double prErr = 10000000; // double error = 100; // trainer.LearningRate = 1; // while (error > 0.001) { // error = trainer.RunEpoch(input, output); // , if ( Math .Abs(error - prErr) < 0.000000001) { // 2 trainer.LearningRate /= 2; if (trainer.LearningRate < 0.001) trainer.LearningRate = 0.001; } prErr = error; } // double [] result; result = net.Compute( new double [] { 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 }); result = net.Compute( new double [] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }); result = net.Compute( new double [] { 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 }); result = net.Compute( new double [] { 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 }); } } } * This source code was highlighted with Source Code Highlighter .

Literature and notes

Wikipedia
Library page - www.aforgenet.com
Library Download Page - www.aforgenet.com/framework/downloads.html
Of the entire AForge.NET library, only a part of it is needed - AForge.Neuro.
The AForge.NET library works with double type, because the network output is fractional, but very close to the values we need - 0 and 1. Therefore, to get a full output, you can drive each value of the array through a “threshold” function that returns 0 if the value is less 0.5 and 1 otherwise.

Source: https://habr.com/ru/post/126497/

All Articles