📜 ⬆️ ⬇️

Self-learning chess program

Hello, Habr!

In an article published last year , we solved the problem of determining the mathematically justified values ​​of chess pieces. With the help of regression analysis of parties played by computers and people, we managed to get the value scale of “units”, which largely coincides with the traditional values ​​known from books and practical experience.

Unfortunately, the direct substitution of the corrected values ​​for the figures did not strengthen the author's program - in any case, more than within the framework of statistical error. The application of the original method "in the forehead" to other parameters of the evaluation function yielded somewhat absurd results, the optimization algorithm clearly needed some refinement. Meanwhile, the author decided that the next release of his engine will be the final in a long series of versions originating in the code of a decade ago. GreKo version 2015 was released, and no further changes were planned in the near future.
')
Picture to attract attention

All interested in what happened next - after viewing the picture to attract attention, welcome to the cat.

The motivation for the sudden continuation of the work and, ultimately, the appearance of this article were two events. One of them thundered to the whole world through the media channels - this is the match in Go of the Korean top player Lee Sedol with the Google AlphaGo program.



The developers at Google DeepMind were able to effectively combine two powerful techniques - searching in the tree using the Monte Carlo method and deep learning using neural networks. The resulting symbiosis led to phenomenal results in the form of a victory over two professional players in Guo (Lee Cedol and Fan Hue) with a total score of 9 - 1. The details of the implementation of AlphaGo were widely discussed, including on Habré, so now we will not dwell on them.

The second event, not so widely advertised, and seen mainly by chess-programming enthusiasts, is the emergence of the Giraffe program. Its author, Matthew Lai, actively used the ideas of machine learning, in particular, all the same deep neural networks. Unlike traditional engines, in which the evaluation function contains a number of predefined characteristics of the position, Giraffe at the training stage independently extracts these signs from the educational material. In fact, the goal of the automatic derivation of “chess knowledge” in the form in which it is presented in textbooks was stated.

In addition to the evaluation function, neural networks in Giraffe were also used to parameterize the search in a tree, which also suggests some parallels with AlphaGo.

The program has demonstrated some success, having reached from scratch in a few days of training the strength of an international master. But, unfortunately, the most interesting research project was prematurely completed ... in connection with the transition of Matthew Lai to work in the Google DeepMind team!

Anyway, the information wave that arose in connection with AlphaGo and Giraffe, prompted the author of this article to once again return to the code of his engine and still try to strengthen his game with the methods of machine learning that is so popular today.

Algorithm



It may disappoint someone, but in the project described there will be neither multilayered neural networks, nor automatic detection of key position attributes, nor the Monte Carlo method. Random search for wood in chess is practically not required due to the limited task, and well-functioning factors for evaluating the chess position have been known since the days of Kaissa . In addition, the author was interested in how you can enhance the game of the program within a fairly minimalist set of them, which is implemented in GreKo.

The basic method was the algorithm for setting the evaluation function, which was proposed by the Swedish researcher and developer Peter Österlund, the author of the strong Texel program. The winning sides of this method, according to its creator, are:



Let θ = (θ 1 , ..., θ K ) be the vector of parameters of the evaluation function (weight of the material and positional features).
For each position p i , i = 1 ... N from the test set, we calculate its static estimate E θ (p i ) , which is a certain scalar value. Traditionally, the assessment is rationed so that it gives an idea of ​​the advantage of one side or the other in units of chess material - for example, hundredths of a pawn. We will always consider the evaluation in terms of whites.

We now turn from the material representation of the estimate to the probabilistic one. Using the logistic function, we will do the following conversion:

R_ {pred} (\ theta, p_i) = \ frac {1} {1 + e ^ {- E _ {\ theta} (p_i) / K}}

The value of R pred makes sense of the mathematical expectation of the result of the game for white in this position (0 is a defeat, 0.5 is a draw, 1 is a victory). The normalization constant K can be defined as such a material advantage, in which "everything becomes clear." In this study, we used the value K = 150 , i.e. one and a half pawns. Of course, “it becomes clear” only in a statistical sense, in real chess games one can find a huge number of counterexamples, when a much greater material advantage does not lead to a gain.

In the original algorithm, instead of a static evaluation function, the result of the so-called PV search was used to calculate R pred . The name is associated with the concept of the forced version, in the English version - quiescence search. This is an alpha-beta search from a given position in which only captures, transformations of pawns, and sometimes shahs and departures from them are considered. Although the search trees are small, compared to the static estimate, the computation speed decreases by tens and hundreds of times. Therefore, it was decided to use a faster scheme, and sift out dynamic positions at the stage of preparing the initial data.

Now, knowing for each position the predicted R pred and actual R fact results of the game in which it met, we can calculate the root-mean-square prediction error:

Err (\ theta) = \ frac {1} {N} \ sum_ {i = 1} ^ {N} (R_ {pred} (\ theta, p_i) - R_ {fact} (p_i)) ^ 2

In fact, the obtained rms estimate can already be considered as the objective functional to be minimized. This approach is described in the original description of the method.

We will make another small modification - we introduce the accounting of the number of moves remaining until the end of the game. Obviously, a positional sign that existed on the board at the very beginning of the game may not have any effect on its outcome if the main events in the game occurred much later. For example, White in the opening may have a proud knight in the center of the board, but lose in the deep endgame, when this knight is already long traded, due to the invasion of the enemy rook into his camp. In this case, the sign “a horse in the center of the board” should not receive too much punishment compared to the sign “a rook on the second rank” - the horse is not to blame for anything!

Accordingly, we add an amendment to our objective function related to the number of moves n i remaining until the end of the game. For each position, it will be an exponential decay factor with the parameter λ . The “physical meaning” of this parameter is the number of moves during which a particular positional attribute influences the game. Again, on average. In the experiments described below, λ takes values ​​in several dozen half-moves.

In the original description of Texel's Tuning Method from the training set, the first moves in the batches were discarded. The introduction of " λ- forgetting" allows us not to impose a clear restriction on the moves from the opening book - their influence somehow turns out to be small.

The final view of our target functionality:

J (\ theta) = \ frac {1} {N} \ sum_ {i = 1} ^ {N} (R_ {pred} (\ theta, p_i) - R_ {fact} (p_i)) ^ 2 (e ^ {-n_i / \ lambda})

The task of learning the evaluation function is now reduced to minimizing J in the space of values ​​of the vector θ .

Why does this method work? In fact, most of the signs arising in positions on large batches of parties, due to averaging, are mutually neutralized. Only those of them that actually influenced the result retain their value and receive higher weights. The sooner the evaluation function starts to notice them during the game - the better the prediction and the more accurate the position estimate, the stronger the program will become.

Training and Results



As an array of data for training, positions from 20 thousand games played with the program were used. From their number were excluded positions arising after the capture of a figure or the announcement of the Shah. This is necessary so that the training set of examples matches as closely as possible with the actual positions from the search tree, for which static evaluation is applied.

The result was about 2.27 million positions. They are not all unique, but this is not critical for the method used. Positions were randomly divided into training and test sets in the ratio of 80/20, respectively 1.81 million and 460 thousand positions.

The minimization of the functional was carried out on a training set by the method of coordinate descent. It is known that for multi-dimensional optimization problems this method is usually not the best choice. However, the simplicity of the implementation, as well as an acceptable execution time, played in favor of this algorithm. On a typical modern PC, optimization for 20 thousand batches takes from one to several hours, depending on the subset of 27 possible weights selected for tuning.

Below is a graph of functional changes depending on time. The criterion for stopping the training is the non-improvement of the result for the test subset after the next cycle of descents in all parameters.



The evolution of a group of parameters related to pawns is shown in the following graph. It can be seen that the process converges quickly enough - at least to a local minimum. The task of searching for a global minimum has not yet been set, our goal now is to strengthen the program by at least some amount ...



Similar graphs for other features.








The following graph presents data related to another training session in which material weights were also adjusted. It can be seen that from the “computer” values ​​used in GreKo, they gradually converge to more classical values.




We give a complete list of evaluation parameters with initial and final values. The meaning of the most part is clear without additional comments, those who wish to familiarize themselves with their exact purpose are invited to the source code of the program - the eval.cpp file, the Evaluate () function.

roomSign ofDescriptionBefore trainingAfter training
one.VAL_Ppawn cost100100
2VAL_Nhorse cost400400
3VAL_Belephant cost400400
four.VAL_Rrook cost600600
five.Val_qqueen cost12001200
6Pawndoubleddouble pawn-ten-ten
7PawnIsolatedisolated pawn-ten-nineteen
eight.Pawnbackbackbackward pawn-ten-five
9.Pawncenterpawn in the center of the boardten9
ten.Pawnpassedfreemaxunblocked pass120128
eleven.Pawnpassedblockedmaxlocked gate100101
12.PawnPassedKingDistdistance from the opponent's king in the endgamefive9
13.Pawnpassedsquarecheckpoint, unreachable by the "rule of the square"50200
14.KnightCenterhorse centralizationten27
15.KnightOutpostprotected item for a horseten7
sixteen.KnightMobilityhorse mobility20nineteen
17BishopPairMidgamea pair of elephants in the middlegame2020
18.BishopPairEndgamepair of elephants in the endgame10095
nineteen.BishopCenterelephant centralizationten9
20.BishopMobilityelephant mobility6072
21.Rook7throok on the 7th rank2024
22Roookopenan open rookten17
23.Rookmobilityrook mobility4040
24QueenKingTropismthe proximity of the queen to the opponent's king4099
25KingCenterMidcentralization of the king in the middlegame-40-41
26KingCenterEndcentralization of the king in the endgame4033
27.KingPawnShieldking pawn shield120120


The table shows an example for one of the training sessions, in which only the positional parameters of the evaluation were optimized, and the cost of the figures remained unchanged. This is not a critical requirement; in the tests described below, full training on all 27 parameters was used. But the best results in the practical game showed a version with a constant scale of material.

What conclusions can be drawn from the weights obtained? It can be seen that some of them have remained almost unchanged compared with the original version of the program. It can be assumed that over the years of debugging the engine, they were selected intuitively enough. However, at some points, cold mathematics amended human intuition. So, the damage of the backward pawns was overrated by the author. But the following parameters seemed more important to the algorithm, their weights were almost doubled:



We should also mention the sign of the checkpoint, which is unattainable according to the “rule of the square”. Its optimized value has reached the limit of the acceptable interval set by the algorithm. Obviously, it could be more. The reason, probably, is that in the training file, in the presence of such a passing pawn, the games were won 100% of the time. As a weight, the value of 200 was left as quite sufficient - its increase has practically no effect on the strength of the game.

Check at the chessboard



So, we have trained the evaluation function to predict the outcome of the game based on the position on the board. But the main test lies ahead - how much this skill will prove useful in a practical game. For this purpose, several versions of the engine with different settings were prepared, each of which was obtained in its own training mode.

VersionTraining fileNumber of batchesEvaluation ratiosScaling constant λ
A20000.pgn20,0006 ... 2740
B20000.pgn20,0001 ... 2740
C20000.pgn20,0006 ... 2720
D20000.pgn20,0001 ... 2720
E20000.pgn20,0006 ... 2760
F20000.pgn20,0001 ... 2760
Ggm2600.pgn272026 ... 2720
H1large.pgn472026 ... 2720
H2large.pgn472021 ... 2720


20000.pgn - program parts with itself (superblits)
gm2600.pgn - grandmasters' games from the Crafty Robert Hyatt FTP site (classic control)
large.pgn - merge the two files

Each of the versions played 100 games with the original program GreKo 2015, as well as with a set of other engines with time control "1 second + 0.1 seconds per turn". The results are shown in the table below. With the help of the program bayeselo, relative ratings of versions were calculated, the power of GreKo 2015 was fixed as a reference point at the level of 2600. The value of LOS (likelihood of superiority) was also determined - the probability that a particular version plays stronger than GreKo 2015 taking into account the confidence interval of rating calculation.

VersionGreKo 2015Fruit 2.1Delfi 5.4Crafty 23.4Kiwi 0.6dRatingLOS
GreKo 20153340.539.573.52600
A53.53849.546.576263797%
B5543.57136.578.5266799%
C52.539.58142.575267299%
D4223.55833.56825747%
E53.53751.54681.5264699%
F5936.56331.579.5264898%
G4824.55943.565.5260254%
H145.54051.540.575.5261681%
H25533.5653974264699%


It is seen that the improvement of the game occurred in all cases except one (version D). It is also interesting that training in the games of grandmasters (version G) had a slight effect. But the addition to the games of grandmasters of their own games of the program, plus the modification of the values ​​of the pieces (version H2) turned out to be a fairly successful combination.

The strongest in the aggregate of results was version C, with an increase in the rating of about 70 points. For a given number of batches, this margin is statistically significant, the error is plus or minus 30 points.

We trained and tested the program at ultra-short time control, when one batch lasts a few seconds. Check how our improvements work in the "serious" game, with longer controls.

Time controlNumber of batchesResultRatingLOS
1 minute. + 1 sec / move200116.5 - 83.5+ 5699%
3 min. + 2 seconds / move10057.5 - 42.5+ 4594%
5 minutes. on 40 moves10053.5 - 46.5+ 2177%


So, despite a slight drop in efficiency with an increase in the length of the games, the trained version definitely demonstrates a stronger game than the engine with the original settings. She was released as another final release of the program.

GreKo 2015 ML



The GreKo 2015 ML program can be freely downloaded along with the source code in C ++. It is a console application for Windows or Linux. To play with a person, analyze or spar with other engines, you may need a graphical interface - for example, Arena, Winboard, or some other. However, you can play directly from the command line, introducing moves using standard English notation.

GreKo's self-learning function is implemented as a built-in command of the console mode (the author currently does not know other engines that support this functionality). The vector of 27 coefficients of the evaluation function is stored in the file weights.txt. To automatically adjust it based on a PGN file, you need to type the command learn, for example:

White(1): learn gm2600.pgn 


The program will read all batches from the specified file, create an intermediate file with training positions and break it into a training and test subset:

 Creating file 'gm2600.fen' Games: 27202 Loading positions... Training set: 1269145 Validation set: 317155 


It will then save the original values ​​of the parameters to the weights.old file, and begin the optimization process. During operation, intermediate values ​​of weights and target functionality are displayed on the screen and in the file learning.log.

 Old values saved in file 'weights.old' Start optimization... 0 0.139618890118 0.140022159883 2016-07-21 17:01:16 Parameter 6 of 27: PawnDoubled = -10 Parameter 7 of 27: PawnIsolated = -19 1 0.139602240177 0.140008376153 2016-07-21 17:01:50 [1.7] -20 2 0.139585446564 0.139992945184 2016-07-21 17:01:58 [1.7] -21 3 0.139571113698 0.139980624436 2016-07-21 17:02:07 [1.7] -22 4 0.139559690029 0.139971803640 2016-07-21 17:02:15 [1.7] -23 5 0.139552067028 0.139965861844 2016-07-21 17:02:23 [1.7] -24 6 0.139547879916 0.139964477620 2016-07-21 17:02:32 [1.7] -25 7 0.139543242843 0.139961056939 2016-07-21 17:02:40 [1.7] -26 8 0.139542575174 0.139962314286 2016-07-21 17:02:48 [1.7] -27 Parameter 8 of 27: PawnBackwards = -5 9 0.139531995624 0.139953185941 2016-07-21 17:03:04 [1.8] -4 10 0.139523642489 0.139947035972 2016-07-21 17:03:12 [1.8] -3 11 0.139518695795 0.139943580937 2016-07-21 17:03:21 [1.8] -2 12 0.139517501456 0.139943802704 2016-07-21 17:03:29 [1.8] -1 Parameter 9 of 27: PawnCenter = 9 Parameter 10 of 27: PawnPassedFreeMax = 128 13 0.139515067927 0.139941456600 2016-07-21 17:04:00 [1.10] 129 14 0.139500815202 0.139927669884 2016-07-21 17:04:08 [1.10] 130 ... 


Upon completion of the training, the weights.txt file will contain the new values ​​of the weights, which will take effect the next time the program is launched.

The learn command can contain two more arguments, the lower and upper bounds of the optimization interval. By default, they are 6 and 27 - i.e. all signs, except the cost of figures, are optimized. To enable full optimization, you must specify the boundaries explicitly:

 White(1): learn gm2600.pgn 1 27 


The algorithm is randomized (in terms of splitting into training and test samples), therefore, for different starts unequal vectors of coefficients can be obtained.

findings



To configure the evaluation function, we used self-learning program (reinforcement learning). The best results were achieved by analyzing the games of the program against itself. In fact, the only external source of chess knowledge was the opening book of the shell, necessary for the randomization of the games played.

We were able to increase the predictive ability of the assessment, which led to a statistically significant increase in the strength of the game at different time controls, both with the previous version and with a set of independent opponents. The improvement was 50 ... 70 points Elo.

It is worth noting that the result was achieved in fairly modest volumes: about 20 thousand parties and 1 million positions (for comparison: AlphaGo studied on 30 million positions from parties of strong amateurs from the server, not counting further games with herself). The evaluation function of GreKo is also very simple, and includes only 27 independent parameters. In the strongest chess engines, their score can go on hundreds and thousands. However, even in such an environment, machine learning methods led to success.

Further improvement of the program could include adding new criteria to the evaluation function (in particular, taking into account the stage of the game for all the considered parameters) and using more advanced methods of multidimensional optimization (for example, searching for global extremes). At the moment, however, the author’s plans in this direction have not yet been determined.



Links



Source: https://habr.com/ru/post/305604/


All Articles