When solving problems using machine learning methods, as a rule, we choose the most suitable algorithm in the context of the problem, as well as the method of setting its parameters.
Let's consider a slightly different approach: instead of choosing the algorithm by ourselves, we will develop a program that can automatically generate algorithms for solving problems.
A stochastic program design method based on the use of a genetic algorithm is called genetic programming.')
A few words about the genetic algorithm
A genetic algorithm is an optimization method based on the idea of natural selection as a means of achieving the best result. The generalized formulation provides the ability to use this algorithm for solving various problems. All you need is:
- how to solve the problem can be encoded in the form of chromosomes, which have the ability to mutate and interbreed
- determine the fitness function, which calculates how optimal is the solution encoded by a particular chromosome
- Having an initial chromosome population and fitness function, using a genetic algorithm, you can find chromosomes that correspond to optimal solutions of the problem
Of course, the genetic algorithm has certain disadvantages associated with its stochastic nature, but based on
my own experience , I can conclude that it finds acceptable solutions in the context of resource-time constraints.
Genetic programming and symbolic regression
I propose to consider in more detail the concept of genetic programming on the example of symbolic regression - the task of finding a formula that describes a certain relationship. More formally, it can be formulated as the
construction of a regression model in the form of a superposition of given functions .
A fairly convenient way to present a program is a syntax tree. In this case, the role of "programs" is played by algebraic formulas. Leaf nodes correspond to variable or numeric constants, and non-leaf nodes contain an operation that is performed on the child nodes.
Example syntax tree:

It should be noted that for each syntactic tree there are an infinite number of semantically equivalent trees, for example:

We will use syntactic trees as chromosomes.
Genetic operations on syntactic trees
Crossing can be done by exchanging randomly selected subtrees:

Mutations can come up with many different options, but ultimately, in its implementation, I settled on the following:
- Replacing a function with a randomly selected other:

- Replacing a subtree with a randomly generated subtree:

- Deleting a non-leaf intermediate node:

- Create a random node that is assigned to the new root of the tree:

- For nodes with noncommutative operations, swapping child subtrees in places:

The syntax tree's fitness function calculates how well the corresponding algebraic formula approximates the data, and, in fact, is the sum of the squares of the deviations of the values of the generated function from the training data.
The initial population is a collection of randomly generated syntax trees.
Optimization
In general, the described operations of crossing and mutation should be sufficient to find a solution. But, based on my experience, I can say that in this case - a more or less acceptable solution will be long enough. Therefore, I will describe several optimizations that speed up the search for solutions.
- Let's consider two formulas created in the process of finding solutions (the red crosses are training data):

It is easy to see that the blue straight line is in general quite close to the training data, unlike the green parabola - which means it has a smaller sum of squares of deviations relative to the training sample.
But, on the other hand, the naked eye can see that the parabola is just what we need.
It's all about the coefficients. As you may have guessed, the coefficients of each tree are optimized. And to optimize the coefficients, of course, a genetic algorithm is used!
In this case, the chromosome is an array of numbers:

- Since we work only with real numbers, it is convenient to operate with functions in which the domain of values is a subset of the domain, for example:
+, -, sin (x), cos (x), ln (abs (x) +0.00001), sqrt (abs (x)), etc.
Thus, as a result of performing genetic modifications, we will be guaranteed to receive syntax trees that do not contain errors.
- There are combinations in which a subtree contains leaf nodes with only numeric constants. You can calculate its value and simplify the syntax tree:

- Another optimization is to limit the maximum depth of the tree (of course, this imposes its own limitations on the search space for solutions, but the need for this optimization is dictated by practical considerations):

- To avoid a situation where all the trees in a population are similar to each other (the genetic algorithm is stuck in a local minimum), there is a non-zero probability of completely replacing one of the trees with a randomly generated one.
Enough theory, let's see how it works in practice.
Let's try to find a formula that will satisfy the following pattern:
x | y | z | f (x, y, z) |
26 | 35 | one | 830 |
eight | 24 | -eleven | 130 |
20 | one | ten | 477 |
33 | eleven | 2 | 1217 |
37 | sixteen | 7 | 1524 |
At 111 iterations we get the formula:

Which after transformations turns into a formula that almost coincides with the required one:

Below is the dynamics of the solution formation

View the entire log of the calculation resultsNo. of population | Sum of squared deviations | Best formula |
---|
0 | 1669012.26766007 | f (x, y, z) = 544,6859000804278 |
---|
one | 169681.53834698 | f (x, y, z) = (x * (28.961135980737247 + z)) |
---|
2 | 148288.46850717001 | f (x, y, z) = (x * (31.37500136808971 + z)) |
---|
3 | 132767.659539771 | f (x, y, z) = (x * (52.43484580560726 + (-16.667350145606648))) |
---|
four | 2985.09234959275 | f (x, y, z) = (x * (4.433746792798393 + x)) |
---|
five | 2983.6495099344102 | f (x, y, z) = (x * (4.4641060233894825 + x)) |
---|
6 | 2983.6495099344102 |
---|
7 | 2983.3906931086399 | f (x, y, z) = (x * (4.454265913806434 + x)) |
---|
eight | 2983.3906931086399 |
---|
9 | 2983.3904005863701 | f (x, y, z) = (x * (4,454298469272653 + x)) |
---|
ten | 597.81897366597798 | f (x, y, z) = ((x * (3,844350383063788 + x)) + y) |
---|
eleven | 594.94169424631298 | f (x, y, z) = ((x * (3.8969889609817003 + x)) + y) |
---|
12 | 594.94169424631298 |
---|
13 | 593.73658852479002 | f (x, y, z) = ((x * (3.882218261498 + x)) + y) |
---|
14 | 465.83708126862598 | f (x, y, z) = (((x * (4.005216664324722 + x)) + y) - z) |
---|
15 | 465.83708126862598 |
---|
sixteen | 465.83708126862598 |
---|
17 | 465.83708126862598 |
---|
18 | 465.83015734565402 | f (x, y, z) = (((x * (4.005911869833458 + x)) + y) - z) |
---|
nineteen | 465.83015734565402 |
---|
20 | 465.83015734565402 |
---|
21 | 415.16018053718898 | f (x, y, z) = (((x * (3.5125714437530116 + x)) + y) - (-11.692166173605955)) |
---|
22 | 415.16018053718898 |
---|
23 | 395.52399773897002 | f (x, y, z) = (((x * (3.382514048684854 + x)) + y) - (-14,647402701410202)) |
---|
24 | 395.07066142434297 | f (x, y, z) = (((x * (3.3745415764367164 + x)) + y) - (-14.718709944856545)) |
---|
25 | 394.26327346841998 | f (x, y, z) = (((x * (3.3648511983617673 + x)) + y) - (-15.239602399729787)) |
---|
26 | 392.88638158772301 | f (x, y, z) = (((x * (3,337209019532033 + x)) + y) - (-15.802043204356028)) |
---|
27 | 392.32411284414297 | f (x, y, z) = (((x * (3.3064294470317237 + x)) + y) - (-16.587523294477112)) |
---|
28 | 392.32411284414297 |
---|
29 | 392.32411284414297 |
---|
thirty | 201.31809082052899 | f (x, y, z) = ((((x * (3.1436791342095485 + x)) + y) - (-3.592869973372564)) + y) |
---|
31 | 164.61977693577799 | f (x, y, z) = ((((x * (3.284782293612155 + x)) + y) - 0.2814995918777923) + y) |
---|
32 | 149.279581721206 | f (x, y, z) = ((((x * (3.428687042834285 + x)) + y) - 3.8492278595932117) + y) |
---|
33 | 149.278346415192 | f (x, y, z) = ((((x * (3,428687042834285 + x)) + y) - 3.8397689886934776) + y) |
---|
34 | 149.278346415192 |
---|
35 | 148.94429071530399 | f (x, y, z) = ((((x * (3.4733961096640367 + x)) + y) - 4.871582132520638) + y) |
---|
36 | 148.94429071530399 |
---|
37 | 148.92149057538799 | f (x, y, z) = ((((x * (3.4733961096640367 + x)) + y) - 4.927910435311452) + y) |
---|
38 | 148.842647569717 | f (x, y, z) = ((((x * (3.4667603760518504 + x)) + y) - 4.761715096087028) + y) |
---|
39 | 148.842126194348 | f (x, y, z) = ((((x * (3.4667603760518504 + x)) + y) - 4.766164660321495) + y) |
---|
40 | 148.83482158482201 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.761715096087028) + y) |
---|
41 | 148.83482158482201 |
---|
42 | 148.83482158482201 |
---|
43 | 148.83482158482201 |
---|
44 | 148.83482158482201 |
---|
45 | 148.824714357071 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.723957086860974) + y) |
---|
46 | 148.824474980844 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.719858152218609) + y) |
---|
47 | 148.824474980844 |
---|
48 | 148.824474980844 |
---|
49 | 148.824474980844 |
---|
50 | 148.82441313535 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.717481429398491) + y) |
---|
51 | 148.82441154759999 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.717364268937347) + y) |
---|
52 | 148.82441154759999 |
---|
53 | 148.82441154759999 |
---|
54 | 148.82441154759999 |
---|
55 | 148.82441154759999 |
---|
56 | 148.82441154759999 |
---|
57 | 148.82441154759999 |
---|
58 | 148.824403242995 | f (x, y, z) = ((((x * (3.464357566836493 + x)) + y) - 4.715925249764239) + y) |
---|
59 | 148.824403242995 |
---|
60 | 148.824403242995 |
---|
61 | 148.824403242995 |
---|
62 | 148.824403242995 |
---|
63 | 148.824403242995 |
---|
64 | 148.824403242995 |
---|
65 | 148.824403242995 |
---|
66 | 148.824403242995 |
---|
67 | 148.824403242995 |
---|
68 | 148.824403242995 |
---|
69 | 148.824403242995 |
---|
70 | 148.824403242995 |
---|
71 | 148.824403242995 |
---|
72 | 148.824403242995 |
---|
73 | 148.824403242995 |
---|
74 | 148.824403242995 |
---|
75 | 148.824403242995 |
---|
76 | 148.824403242995 |
---|
77 | 148.824403242995 |
---|
78 | 148.824403242995 |
---|
79 | 148.824403242995 |
---|
80 | 148.824403242995 |
---|
81 | 148.824403242995 |
---|
82 | 148.824403242995 |
---|
83 | 148.824403242995 |
---|
84 | 148.824403242995 |
---|
85 | 148.824403242995 |
---|
86 | 148.824403242995 |
---|
87 | 148.824403242995 |
---|
88 | 148.824403242995 |
---|
89 | 148.824403242995 |
---|
90 | 148.824403242995 |
---|
91 | 148.824403242995 |
---|
92 | 148.824403242995 |
---|
93 | 148.824403242995 |
---|
94 | 148.824403242995 |
---|
95 | 148.824403242995 |
---|
96 | 148.824403242995 |
---|
97 | 148.824403242995 |
---|
98 | 148.824403242995 |
---|
99 | 148.824403242995 |
---|
100 | 109.738448314378 | f (x, y, z) = ((((x * (3,6304822654527666 + x)) + y) - ((y * 0.26890083188968594) - (-4.125304601214528))) + y) |
---|
101 | 109.738448314378 |
---|
102 | 86.7213316804214 | f (x, y, z) = ((((x * (3,454146360987013 + x)) + y) - ((y * 0,26890083188968594) - 0,31706243101113074)) + y) |
---|
103 | 86.077603952275396 | f (x, y, z) = ((((x * (3.4532441079663054 + x)) + y) - ((y * 0.2821822285682245) - 0.5330637639727107)) + y) |
---|
104 | 84.787024859776594 | f (x, y, z) = ((((x * (3.418583927986026 + x)) + y) - ((y * 0.30109799837668216) - 1.6913597460963947)) + y) |
---|
105 | 84.787024859776594 |
---|
106 | 19.436528477129301 | f (x, y, z) = ((((x * (3.1298089197766688 + x)) + y) - ((y * (-1.1430183282239135)) - (-0.7601011336523038))) + z) |
---|
107 | 9.2899337994931397 | f (x, y, z) = ((((x * (3.1298089197766688 + x)) + y) - ((y * (-0.9847117571207198)) - 2.0456442176615)) + z) |
---|
108 | 9.2899337994931397 |
---|
109 | 8.5880221046539802 | f (x, y, z) = ((((x * (3,1002105039045555 + x)) + y) - ((y * (-0.9847117571207198)) - 2.0456442176615)) + z) |
---|
110 | 8.5880221046539802 |
---|
111 | 0.38510898634568402 | f (x, y, z) = ((((x * (3,0172293562767245 + x)) + y) - ((y * (-0.9842629202499209)) - 4.912061414835644)) + z) |
---|
It is interesting to follow the stages of solution formation:
- At first, the dependence on the variable x is found (this stage is not shown in the figure, since the error value is still very large - approximately 130,000)
- then the formula is transformed into a quadratic dependence on x (the sum of squared deviations is ~ 3000)
- then the dependence on the variable y is identified (the sum of squared deviations is ~ 500)
- and finally, after optimizing the numerical parameters and identifying the dependence on the variable z , an optimal result is achieved.
One more example
Sometimes formulas are obtained that are not so obvious. For example, once, when trying to get the formula

I got the following result:

But if you take advantage of the identity:

It can be shown that the resulting formula is correct:

If you want to experiment
Console utility
- You must have Java runtime installed
- Download the file symbolic_regression_1.0.jar from here.
- In the same directory where symbolic_regression_1.0.jar is located, create a file (for example, data.txt ) with the following content:
# allowed functions are: ADD SUB MUL DIV SQRT POW LN SIN COS
# set which functions to use:
ADD MUL SUB
# looking for:
f (x, y, z) -?
# define training set:
f (26, 35, 1) = 830
f (8, 24, -11) = 130
f (20, 1, 10) = 477
f (33, 11, 2) = 1217
f (37, 16, 7) = 1524
- Run a search for a solution:
java -jar symbolic_regression_1.0.jar data.txt
The solution is considered to be found if the sum of squared deviations is less than 10. If the solution is not yet found, then after every 50 iterations you will be asked if you want to continue.
Java programmers
You can find the source code here:
github.com/lagodiuk/genetic-programmingInstead of conclusion
On a rather limited number of formulas, which I managed to check, good results were obtained, there were also successful experiments in finding antiderivatives and derivatives.
Also, genetic programming can be quite successfully applied to the automatic construction of neural networks (
here is a small demo of the application in which I implemented this approach ).
Another idea is to apply a similar approach (together with QSAR techniques) to automatically construct the structures of chemical compounds with desired properties.
Literature
On the web you can find a lot of material on the subject of genetic programming.
Separately, I want to note the following sources:
- Toby Segaran, Programming Collective Intelligence
- Site creator of the concept of genetic programming