How can you simplify and speed up the calculation of the neural network of direct propagation
Hello, dear readers. A lot has been written and said about neural networks, mainly about how and for what they can be applied. At the same time, there are somehow not very much attention paid to two important issues: a) how to simplify the neural network and quickly compute (one computation of the exponent is implemented by the library functions of programming languages, usually not less than 15-20 processor instructions), b) what at least in part, the logic of the constructed network — in fact, the huge matrices of weights and offsets obtained after training the network somehow do not really help to understand the patterns that this network has found (they remain hidden and their task is to determine willows tion of - sometimes very important). I will talk about one of my approaches to solving these issues for ordinary neural networks of direct propagation, while I will try to do with a minimum of mathematics.
Some theory
A network of direct propagation, from a mathematical point of view, is a very large function, which includes the values of network inputs, weights and displacements of neurons. In each layer of the neuron, the layer input values (vector X) are multiplied by the neuron weights (vector ), folded with offset
and enter activation functions forming the outputs of the neuron layer.
Activation functions may not be very simple to calculate, for example, they often contain exponents (exponential sigmoid, hyperbolic tangent). If you look at the assembler code that implements the exponents, you can find, firstly, a lot of different checks that are not always needed, and secondly, the exponential calculation itself is usually performed in at least two operations:
Therefore, if we want to speed up the calculation of the network, then the first task will be to simplify the calculation of the activation function. You can try to sacrifice a little quality due to a gain in speed, approximately replacing the calculation of the classical activation function with the calculation of a simpler function, which (on the available input data) gives approximately the same results. This is, generally speaking, a classical interpolation problem: we have a set of values calculated by the original function A (s), and we select a simpler function that gives very similar values. Such a simple function a (s) can be an ordinary polynomial, or a polynomial with negative powers, or something else like that. I used four types of such functions:
; ; ; ; ')
Suppose that for each neuron we managed to replace the activation function with a slightly simpler one — this can be done, for example, by applying the least squares method. By itself, such a replacement of a very large gain, perhaps, will not give. But here you can try another trick:
Record analytically huge function NET (X), calculated by the network as a whole;
Replace the original functions A (s) in NET (X) with the replacement functions a (s) obtained for them;
Algebraically derived NET (X) to simplify (or rather, use some ready-made program code of symbolic simplification of expressions). This is already possible (at least, much simpler than we would have tried to simplify the network with the original functions, for example, with the exhibitors).
As a result, we get something simpler and, perhaps, a bit more mathematically visual - here you can already try to understand what kind of function the network has.
This is the option to explain the logic of the constructed network.
The described task, of course, only in words looks simple. To use it in my programs, I needed to write my own code for symbolic simplification of expressions. In addition, I solved a more complicated problem, assuming that each neuron with the function A (s) can have several alternative alternative activation functions. Therefore, the overall task has been reduced to the enumeration of variants of such functions and the symbolic simplification of the network for each such variant. Here only paralleling of calculations has already helped.
Result
The result pleased me. I accelerated a three-layer network (with three inputs) of eight neurons (with input weights and offsets) with activation functions “exponential sigmoid”. As time measurements showed, it was possible to get a gain of about 40% in time without significant loss in quality.
Illustrate. Here is the source network data:
And in the third, output layer:
If the inputs are designated as a, b and c, then, after replacements and simplifications, the network function NET is considered as:
Winning - again, 40% of the time, without much damage to quality. I think you can apply this approach in cases where the speed of the neural network is critical - for example, if it is calculated repeatedly, in a double or triple cycle. An example of such a problem : the numerical solution of the aerodynamic problem on a grid, and in each of its nodes the neural network calculates some useful forecast, for example, for a more accurate calculation of turbulent viscosity. Then we have an external cycle in time, a double or triple cycle in the coordinate is enclosed in it, and already there, inside, the calculation of the neural network sits. In this case, the simplification is more than appropriate and useful.