Mini AI Cup # 3: Writing a Top Bot

At the beginning of autumn, the contest for writing bots Mini AI Cup # 3 (aka Mad Cars) was completed, in which participants had to fight on typewriters. The participants argued a lot about what would work and what would not be, ideas from simple ifs to learning neural networks were expressed and tested, but the top places were taken by the guys with the so-called “simulation”. Let's try to figure out what it is, compare the solutions for the 1st, 3rd and 4th places and discuss other possible solutions.

Disclaimer

The article was written in collaboration with Alexey Dichkovsky (Commandos) and Vladimir Kiselev (Valdemar) .

For those who just want to read about the decisions of the winners, I advise you to immediately start from the point "Simulation".

Formulation of the problem

This time, the mechanics of the world looked a lot like the mobile game Drive Ahead: players were given a car with a button on it; the task is to push the button of the enemy faster than he does. If nobody wins over 600 game ticks, the card starts to sink into a pile of garbage, which can also press a button. In other words, you need to protect your button from enemies, the outside world and heaps of garbage (vitally, yes). Each player was given 5 lives, the game went from 5 to 9 rounds, until someone ended their lives. Each round was held on a random map and typewriters, the same for both participants. In total there were 6 different cards and 3 types of cars - a total of 18 different combinations.

Each round is divided into tiki. A tick is one move, like in chess. The only difference is that both players walk at the same time. There are competitions where everyone moves in turns, or you can do the action only once in several moves and select units with a frame .
Each tick bot comes a state of the world and is given the opportunity to perform 3 actions: , , . These actions force the car to go in one of the directions, and if it does not touch the wheels of the earth, it gives a slight rotation to the whole body (a bit of arcade physics). After both opponents have chosen an action, a simulation of the game world starts, a new state is considered and sent to the players. In case someone pushed the button, the round ends and the next one starts. It's simple, but there are nuances.

More complete rules can be found here . A game finals see here .

General description of the solution

Most of the bots writing competitions are very similar: there are a finite number of ticks (there is approximately 1500 maximum for one round), there are a finite number of possible actions, you need to choose a sequence of actions to be better than your opponents. A little later, let us return to what it means to be better, but for now let's deal with how to deal with the main problem - a huge number of options: at the start we have one initial state, then each machine can move in three different ways that gives us 9 different combinations for two machines, by the 1500 move it will be already 9 ^ 1500 different combinations ... Which is a little more than we would like if we plan to have time to sort through them during the Universe existence.

This is where we come to what simulation is . This is not some kind of algorithm, but simply the re-creation of the rules of the game with sufficient or complete accuracy so that you can sort out the solutions. Of course, we will go through not all the solutions, but only a part of them. The search algorithm will be used for this - in the tree of the game states we are looking for the best for us. There are a lot of algorithms (from minimax to MCTS), each has its own nuances. It is best to familiarize yourself with what decisions were written by participants in past competitions in AI. This will give a basic understanding of the conditions under which the algorithms work, and under which not. There are many references for this in a special repository .

When choosing an algorithm should be considered:

time limit for 1 tick (here I miscalculated strongly this year, but I was able to stay in 3rd place);
number of players. For example, if there are three players, it will be difficult to use a minimax;
simulation accuracy, because this may allow reuse of old calculations;
"branching" of the state tree (is it possible to calculate all possible states at least 10 moves ahead);
common sense - do not start writing MCTS if the competition lasts 4 hours.

In this competition, 1 tick was given about 10-13ms (2 minutes for the whole game). During this time, the bot had to read the data, make a decision and send a command to move. This was enough to simulate about 500-1000 moves. Enumerate all the states will not work. The simplest search algorithm can look like a comparison of three options for movement: "50 ticks go left", "50 ticks go right", "50 ticks press stop". And no matter how simple it may sound, it is not very far from the decision of the winner.

Since we count only 50 moves ahead, which in most cases does not count to the end of the game, then we need an evaluation function that tells how good and bad the state of the world is for us. Most often it is built on heuristics and understanding what is important for victory. For example, in the 2014 Russian AI Cup competition there were races, but it was possible to win and be the last to arrive if you earn more bonus points. Hence, the evaluation function should stimulate the collection of points at the same time as fast movement on the highway. The estimate can be calculated only for the last simulation state (after 50 ticks) or as the sum of the estimates of intermediate states. Often, the estimate "fades out" in time, so that states that occur earlier are more affected. Since we can not predict the enemy for sure, and future options are less likely to happen, we will not rely heavily on them. Also, this technique causes the bot to quickly perform its tasks, rather than postpone everything until later. But it is worth noting that the bot will be less risky for the sake of subsequent benefits.

Since we are going to predict the state of the world in response to our actions, then we need to somehow simulate the behavior of enemies. There is nothing difficult and there are a couple of common options:

Stub or heuristics
It is written simple logic of behavior, where the enemy simply does nothing, or chooses actions based on simple heuristics (for example, you can use your first versions of the strategy or simply repeat the opponent’s previous move).
Use the same algorithm as for yourself.
First we try to find the best actions for the enemy (against our best set of actions from the last move, or against the stub), and then we are looking for the best action for ourselves, using the behavior that we found for the enemy. Here the bot will try to resist cunning enemies. Such logic does not work well at the start of the competition, since Many bots are still very weak, and your decision will be too cautious with them.
Other
The same minimax enumerates all the moves of the players at the same time, and he simply does not need heuristics.

If you implement all the above steps, you will most likely get a very good bot, especially if you manage to find a good evaluation function. But, looking through the fights, it is possible to notice that in certain situations he behaves strangely. Correcting the evaluation function for these situations can be difficult, or there is a high risk of breaking another logic. Here crutches and ifs come to the rescue. Yes, the last days of the competition often boil down to writing crutches and ifs to correct the flaws in some specific conditions. Personally, I don’t like this part very much, but I’ve noticed many times that crutches in the finals can affect the location of places in the top ten, which means one unwritten if can cost you a prize (my heart hurts when I write these words, I I also love beautiful algorithms and solutions).

Q: Is it possible to do without simulation at all?
A: Yes, you can use solutions on heuristics (decision trees, lots of ifs, etc.). There is a good article with AI architectures on heuristics.

Q: How much better is the use of simulation approaches on heuristics?
A: It all depends on the task. For example, here some combinations of cards and machines could be hard-coded with ifs and always win (or a draw). However, often the simulation finds solutions that are difficult to think of yourself, or it is difficult to implement heuristics. In this competition, when you turn over another machine, the solutions on the simulations put their wheels on the enemy's wheel, which turned off the flag "in the air", which means the enemy could not use body rotation and turn back on the wheels. But the decision did not reflect on the meaning of this; it simply found options where the enemy would quickly fall onto the roof and press its button.

Q: Neural networks and RL?
A: No matter how popular it is, in solutions of bots such solutions rarely show themselves well. Although neural networks do not need to be simulated, because they can simply produce an action based on the input parameters of the current state, they still need to somehow learn, and for this they often have to write a simulator to drive games by the thousands locally. Personally, I believe that they have the potential. Perhaps they can solve part of the problem, or use it in conditions of very limited response time.

Note
Regarding the finite number of possible actions, it is worth clarifying that sometimes it is allowed to "smoothly" adjust a parameter. For example, not just go ahead, but by some percentage of power. In this case, the “limbs” of the number of outputs is easy to achieve simply by using several values, for example, 0%, 25%, 50%, 75% and 100%. Most often just two are enough: "fully on" and "completely off."

Simulation

In this competition, the ready chipmunk physics engine was used. The expectations of the organizers were that he is old, time-tested and has many wrappers so that everyone can incorporate it into their decision ...

In the harsh reality, the engine produced different values each time, which made it difficult for it to restart to miscalculate the move options. The problem was solved “in the forehead” - my memory allocator was written in C and the piece of memory with the state of the world was copied entirely. Such an allocator put an end to the possibility of writing solutions in languages other than C ++ (in fact, this was possible, but it would be very laborious and the allocator would still have to write in C). In addition, the accuracy of prediction was influenced by the order of adding elements to the game world, which required a very accurate copy of the code used by the organizers for rendering games. But he was already in Python. The final nail in the coffin of other programming languages was that the engine is old and contains many optimizations that cannot be exactly recreated during the competition to get your trimmed version of the physics simulation.

As a result, the engine, which was supposed to give all participants equal conditions for the simulation of moves, became the most difficult obstacle for this. Man overcame it. The first 7 leaderboard places were taken exclusively by the guys who made an accurate simulation, which may serve as some proof of its importance in such competitions.

With the exception of a couple of participants who were able to climb inside the chipmunk and optimize the copying of its state, the others had a simulation of approximately equal performance (which made the competition a bit more interesting, because you know that the struggle is for the solution algorithm, and not "who will count the moves more").

Algorithm of search and prediction of the enemy

From this point begins a separate description of the solutions. The description of the algorithms will be conducted on behalf of its author.

Vladimir Kiselev (Valdemar) 4 place

A random search (Monte Carlo) was used to search the solution space. The algorithm is as follows:

We initialize the genome - a sequence of actions (left, right, stop) for 60 ticks with random data.
We take the best genome found
Randomly change one of the actions
Using the evaluation function, we get a number - an indicator of how good the new genome is.
If you have a better solution, then update the best solution.
Repeat again with paragraph 2

My simulator gave out ~ 100k simulations of the world in 1 second, considering that there is an average of ~ 12ms per tick, we get 1200 actions per tick. That is, for 1 tick we manage to complete a full cycle about 20 times.

To find the optimal solution of such a number of iterations is clearly not enough. Therefore, the idea of “stretching” actions was implemented: instead of the genome of 60 moves, we will operate on a chain of 12 “stretched” moves - we believe that each action lasts 5 ticks in a row.
Plus: Improving the quality of mutations by reducing the length of the genome, a simulation can also be run every 5 ticks and check 100 genomes instead of 20 (to avoid falls over the time limit, eventually stopped at 70).
Minus: Stretching actions can lead to non-optimal solutions (for example, swinging on the bumper, instead of a stable rack)

It should be noted techniques that have significantly improved the quality of the algorithm:

We carry out random initialization only on the first tick, all the rest of the time we reuse the best found solution with a shift of 1 turn (the action on the 2nd tick is shifted to the 1st, etc., the random action is added to the end). This greatly improves the quality of the search, because otherwise the algorithm “forgets” what it was going to do on the last tick and meaninglessly twitches in different directions.
At the beginning of the turn, we make more intensive changes (changing the genome 2 or 3 times instead of one) in the hope of getting out of a local maximum (similarity of temperature in the method of simulating annealing).
The intensity was selected manually: the first 30 iterations make 3 mutations, the next 10 by 2, then by 1.
It is very important to predict the actions of the enemy. To the detriment of time to search for your own solution, we launch a random search by the enemy, for 20 iterations, then 50 for ourselves, using the information on the opponent’s optimal moves.
The best decision of the opponent is also reused on the next move with an offset. At the same time, during the search for the solution of the enemy, the genome from the last move is used as my intended actions.

During the competition, I actively used tools for local development, which allowed us to quickly find bugs and focus on the weak points of the strategy:

local arena - the launch of many matches against the previous version;
visualizer for debugging behavior;
script for collecting statistics on matches from the site - allows you to understand on which maps and machines most often occur defeats.

mortido:
It is risky to count every 5 ticks, especially if the enemy moves away from the options you predicted. On the other hand, in this game world for 5 ticks, not much happened.
Also, in my decision, I, nevertheless, added random combinations of each tick, but I won’t say exactly how this affected the decision.

Commandos:
Changing a couple of actions with such a number of simulations does not seem to be very meaningful, because very little changes occur in one action. But when stretching one action by 5 ticks, the meaning seems to be getting bigger.
I also do not like the idea itself - we take the best set and try to rule it somewhere in the beginning. It seems to me illogical that the change of the first tics will leave the subsequent ones at least relatively adequate.

Alexander Kiselev (mortido) 3 place

Armed with articles from the winners of other contests, I decided to use a genetic algorithm. It turned out, however, something similar to a random search or even an imitation of annealing, but more on that later.

Encode the solution with an array of 40 numbers, where -1, 0 and 1 correspond to the movement of , and .

At the beginning of each round, I considered how much time I had spent for the entire game, considered a new time limit based on how many more rounds there would be, and I considered each round a duration of 1200 ticks. So Initially, I tried to spend no more than 11ms per move, but I could “roam” a little at the end if the previous rounds were faster than 1200 ticks.

Valdemar:
Interestingly, this chip has worsened my game. It turned out that it is always better to spend 20-30ms than first 11, and at the end 60

A third of this time I was looking for the best move of the enemy, the rest was spent on calculating my own decision. When searching for a move for an enemy, my behavior was modeled as the best from the last move, shifted by 1 tick. Those. as if I continue to act according to the plan made in the last tick, and he is trying to resist me.

The search for the solution itself was the same for both itself and the opponent:

We take the decision from the last move and move it by 1 move (which we have already done)
Throwing in the population of random solutions until we fill it all
We simulate all the solutions and expose the fitness using the evaluation function. We remember the best.
While there is time for calculations
1. Hint, always add 1 mutation of the current best solution to the population, remember it if it is better
2. As long as there is a place in the new population and the time for calculations is not exceeded (you can get out on the floor of the populated population)
  1. We take two different individuals and leave with the best fitness - mom
  2. We take two different individuals and leave with the best fitness - dad (should not coincide with my mother)
  3. We cross them
  4. Mutated if RND <
  5. We simulate the decision and remember it if it is the best.

As a result, we will return a sequence of actions that is considered optimal. The first move in it is sent as a bot action. Unfortunately, in my plan there was a serious drawback, since the number of simulations that can be done for a tick was very small (including due to the long evaluation function), then on the competition server 4 the item was performed only 1 time, and for the enemy it wasn’t completely performed at all. This made the algorithm look more like a random search or annealing imitation (since we had time to mutate one time solution from the last move). It was already too late to change something, and we managed to keep 3rd place.

The implementation of the crossing, mutation, and generation of initial random solutions is important, because depends on what decisions will be checked, and the full random is not as good as it might seem at first glance (it will work, but it will take much more options to sort through).

In the final version, the generation of random solutions occurred in segments, which excluded the solutions "jerking" in one place:

A random team was selected.
For the entire solution length (40 moves)
1. Write the current command to the cell.
2. With a probability of 10%, we change the current command to a random one.

A similar technology also involved a mutation — a random segment of the solution was replaced with a random command. Crossing occurred by choosing the point to which the decision was taken from 1 parent, and then from the 2nd.

I liked that we use all the time available to us to find the best solution. And it’s not scary if the solution is not the best - we can improve it on the next tick, because optimization turns out "smeared" in time. , . , - , . ,

Valdemar:
1 , , .

Commandos:
— - .
— , . , … , . " ”. -.

(Commandos) 1

( ), n m . 3^2=9 . m + n 40 .

 |----------- n  -----------|---------- m  --------| |   ...   |   ...   |

: , , . ( ).

n m , . , .

, ( , ):
- , , , .
- , , . . . , , , .
- . ; ( ).
n m . , .1, , . - ( ) , — , ;
. , — . , ( ).

Valdemar:
, 2 . . , .
')
mortido:
Great! , . . , 2 , 40-60 . , 3 .
n + m == const ?

. n + m != const , . , . - .

(Valdemar) 4

, . , ( , , ..) [0..1].
. : , .
, , : , .

, :

— 70 180 ( : ).
, .
0..500
— [2pi, pi/4] [0, 1]
— , ( ), ( , , )
— , , , .
, , .
— . .
Y — .

, 2 , .

“” ,
“ ” , , .

mortido:
, .. , .

Commandos:
, . -

(mortido) 3

, chipmunk. . , , , , . .

3 .

, ( , , ):

. , , ( , );
, — , ; , 1 ;
;
( , );
( “+”, “-”);
- ( “+”, “-”); , , , ;
30 , , ( );
, .

, , (, , )

Valdemar:
. , “ , , ” , ( ..) .

, , . .

Commandos:
, , “”… ? , “” .

(Commandos) 1

SquaredWheelsBuggy , .. , . Buggy , , ( /).
:

;
; — , , 1 0; .. ;
. ; 10 ( );
( , );
(, );
— - , ;
/ ; , — ; .

1-5 , . 2 “ ”.

Valdemar:
, . , .

mortido:
, 10 .

IF'

(Valdemar) 4

, if'. 3 , , . , , -.
: , “” — , - , ( , ) — .

. , .
— , .
. “ ” .
, , .
, , .

, : . , , if' .

mortido:
, . .

Commandos:
if'. , , … , , .

(mortido) 3

- .

3 . . . “”, . , , .

, “” . . , , , - . . , , .. .

, , , , , . … . - — , ( , ).

Valdemar:
, . . “” , if'. , — .

, + . , .

Commandos:
… , - — , , . , , .

(Commandos) 1

. (, , ). ( ) /.

pill carcass map , , ( ). island map, , .

island hole buggy. / , , ( ). — . , , . SquaredWheelBuggy . , , , . , … , , .

(Pill map, Bus) , ( / 100% ).

pill hubble map. , ( ), . .

— , ...

, . , . ( ).

Valdemar:
, — . , .

mortido:
, “” .

Valdemar:
. , . ( ) .
. “”, , , , :)
, mailru , .

mortido:
: , … , , ( ). , 3 , , … .

Commandos:
- , . , , , . … . — , .
— ++. . , . 1 -.

, . , . , , .

Mail.Ru Group .