📜 ⬆️ ⬇️

Search optimization in width: how to handle a graph with 10 billion states

image

A couple of months ago, I finally had to admit that I was not smart enough to complete some levels of the Snakebird puzzle. The only way to regain part of self-esteem was to write a solver. So I could pretend that creating a program to solve a puzzle is almost the same thing as solving it myself. The code for the resulting C ++ program is posted on Github . The main part of the code considered in the article is implemented in search.h and compress.h . In this post, I will mainly talk about search optimization in width, which would require 50-100 GB of memory to fit in 4 GB.

Later I will write another post in which the specifics of the game will be described. In this post, you need to know that I have not managed to find any good alternatives to brute force, because none of the usual tricks worked. There are a lot of states in the game, because there are a lot of moving or pushed objects, and the shape of some of them, which can change with time, is important. There was no suitable conservative heuristics for algorithms like A *, which allow narrowing the search space. The search graph was oriented and implicitly given, so a simultaneous search in the forward and reverse directions was impossible. A single move could change the state in a number of ways unrelated to each other, so nothing like Zobrist's hashing could be useful.

Approximate calculations have shown that in the biggest puzzle after eliminating all symmetric positions there will be about 10 billion states. Even after packing the description of the states with the maximum density, the size of the state was 8-10 bytes. With 100 GB of memory, the task would be trivial, but not for my home machine with 16 GB of memory. And since Chrome needs 12 GB of them, my real memory capacity is closer to 4 GB. All that will exceed this amount will have to be saved to disk (old and rusty hard drive).

How to fit 100 GB of data in 4 GB of RAM? Either a) states need to be compressed to 1/20 of their original, already optimized size, or b) the algorithm should be able to efficiently save states to disk and back, or c) a combination of the two methods above, or d) I need to buy more RAM or rent a powerful virtual machine for a few days. Option I did not consider, because it is too boring. Variants A and B were excluded after proof of concept using gzip: a fragment of a description of states of 50 MB in size shrunk to only 35 MB. This is about 7 bytes per state, and my memory capacity is about 0.4 bytes per state. That is, option B remained, even though the search in width seemed rather inconvenient for storage on secondary drives.
')

Content


This is a fairly long post, so here’s a quick overview of the following sections:


Search in width "by textbook"


What is the search width, and why it should not use the disc? Before this small project, I considered only versions of the wording “from textbooks”, for example, such:

def bfs(graph, start, end): visited = {start} todo = [start] while todo: node = todo.pop_first() if node == end: return True for kid in adjacent(node): if kid not in visited: visited.add(kid) todo.push_back(kid) return False 

In the process of creating new candidate nodes by the program, each node is checked with a hash table of already visited nodes. If it is already in the hash table, then the node is ignored. Otherwise, it is added to both the queue and the hash table. Sometimes in implementations information “visited” is entered into nodes, and not into a third-party table; but this is a risky optimization and it is absolutely impossible if the graph is given implicitly.

Why is the use of a hash table problematic? Because hash tables are prone to creating a completely random memory access pattern. If they do not, then this is a bad hash function, and the hash table is likely to have poor performance due to collisions. This random access pattern can cause performance problems even if the data fits in memory: access to a huge hash table is likely to cause cache misses and the associative translation buffer (TLB). But what if a significant portion of the data is on disk and not in memory? The consequences will be catastrophic: something of the order of 10 ms per search operation.

With 10 billion unique states, we only need about four months of waiting for disk I / O to access the hash table. It does not suit us; The task absolutely needs to be converted so that the program can process large data packets in one pass.

BFS with sorting and merging


If we were to maximally combine data access operations into packets, what would be the most attainable approximation? Since the program does not know which nodes to process in the N + 1 depth layer before fully processing the N layer, it seems obvious that you need to perform state deduplication at least once per depth.

If we simultaneously work with a whole layer, then it is possible to abandon the hash tables and describe the set of visited and new states as some sorted streams (for example, as file streams, arrays, lists). We can trivially find a new visited set using the union of stream sets and just as trivially find the todo set using the set difference.

Two operations with sets can be combined so that they work in one pass with both threads. In fact, we look at both streams, process a smaller element, and then move forward along the stream from which the element was taken (or along both streams if the elements at the beginning are the same). In both cases, we add an item to the new visited set. Then we move forward along the flow of new states, and also add an element to the new todo set:

 def bfs(graph, start, end): visited = Stream() todo = Stream() visited.add(start) todo.add(start) while True: new = [] for node in todo: if node == end: return True for kid in adjacent(node): new.push_back(kid) new_stream = Stream() for node in new.sorted().uniq(): new_stream.add(node) todo, visited = merge_sorted_streams(new_stream, visited) return False # Merges sorted streams new and visited. Return a sorted stream of # elements that were just present in new, and another sorted # stream containing the elements that were present in either or # both of new and visited. def merge_sorted_streams(new, visited): out_todo, out_visited = Stream(), Stream() while visited or new: if visited and new: if visited.peek() == new.peek(): out_visited.add(visited.pop()) new.pop() elif visited.peek() < new.peek(): out_visited.add(visited.pop()) elif visited.peek() > new.peek(): out_todo.add(new.peek()) out_visited.add(new.pop()) elif visited: out_visited.add(visited.pop()) elif new: out_todo.add(new.peek()) out_visited.add(new.pop()) return out_todo, out_visited 

The data access pattern is now completely linear and predictable; there are no random access throughout the merge. Therefore, the delay in disk operations becomes irrelevant to us, and the only thing that remains important is throughput.

What would theoretical performance look like with a simplified distribution of data across 100 depth levels, each of which has 100 million states? The average state will be read and written 50 times. This gives 10 bytes / state * 5 billion states * 50 = 2.5 TB. My hard drive is supposedly able to read and write at an average speed of 100 MB / s, that is, on I / O it will take an average of (2 * 2.5 TB) / (100 MB / s) = ~ 50k / s = ~ 13 hours . This is a couple of orders of magnitude less than the previous result (four months)!

It is also worth noting that this simplified model does not take into account the size of the new generated states. Before the merge stage, they must be stored in memory for sorting + deduplication. We will look at this in the sections below.

Compression


In the introduction, I said that in the initial experiments, the compression of states did not look promising, and the compression ratio was only 30%. But after making changes to the algorithm, the states became ordered. They should be much easier to compress.

To test this theory, I used zstd with a puzzle of 14.6 million states, each state of which had a size of 8 bytes. After sorting, they shrank to an average of 1.4 bytes per state. It is like a serious step forward. It is not enough to run the entire program in memory, but it can reduce the disk I / O time to just a couple of hours.

Is it possible to somehow improve the result of a modern general-purpose compression algorithm, if we know something about the data structure? You can almost be sure of this. A good example of this is the PNG format. Theoretically, compression is just a standard Deflate pass. But instead of compressing raw data, the image is first converted using PNG filters . A PNG filter is essentially a formula for predicting the value of a raw data byte based on the value of the same byte in the previous line and / or the same byte of the previous pixel. For example, the “up” filter converts each byte by subtracting the value of the previous line from it when compressing, and performing the opposite operation during decompression. Given the types of images for which PNG is used, the result will almost always consist of zeros or numbers close to zero. Deflate can compress such data much better than raw data.

Can you apply a similar principle to BFS status records? It seems that this should be possible. As in the case of PNG, we have a constant line size, and we can expect the adjacent lines to be very similar. The first samples with a subtraction / addition filter followed by zstd resulted in an improvement in the compression rate by an additional 40%: 0.87 bytes per state. Filtering operations are trivial, so they are practically “free” from the point of view of CPU consumption.

It was not clear to me whether any other improvements could be made, or this was a practical limit. In these images, it is logical to expect that the adjacent bytes of one line will be similar. But in these states there is no such thing. But in fact, slightly more complex filters can still improve results. In the end, I came to this system:

Suppose we have adjacent rows R1 = [1, 2, 3, 4] and R2 = [1, 2, 6, 4]. When outputting R2, we compare each byte with the same byte of the previous line, and 0 will indicate a match, and 1 - a mismatch: diff = [0, 0, 1, 0]. Then we transfer this bitmap, encoded as VarInt, followed by only bytes that do not match the previous line. In this example, we get two bytes 0b00000100 6. By itself, this filter compresses the reference data to 2.2 bytes / state. But by combining the filter + zstd, we reduced the size of the data to 0.42 bytes / state. Or, to put it another way, it is 3.36 bits per state, which is just a little more than our approximate calculated figures necessary for all data to fit into RAM.

In practice, compression rates improve because sorted sets become denser. When the search reaches the point where memory begins to cause problems, compression rates can become much better. The biggest problem is that at the end we get 4.6 billion states visited. After sorting, these states occupy 405 MB and are compressed according to the above scheme. This gives us 0.7 bits per state . In the end, compression and decompression take up about 25% of the program's CPU time, but this is an excellent compromise for reducing memory consumption by a hundred times.

The above filter seems a bit costly due to the VarInt header on each line. It seems that it is easy to improve at the cost of low CPU costs or a small increase in complexity. I tried several different options, transposing the data into column order, or writing bitmasks to larger blocks, etc. These options, by themselves, gave much higher compression rates, but did not perform as well when the output of the filter was compressed by zstd. And it was not some kind of zstd error, the results with gzip and bzip2 turned out to be similar. I do not have any particularly brilliant theories about why this particular type of encoding turned out to be much better in compression than other options.

Another mystery: the compression ratio turned out to be much better when the data is sorted by little-endian, rather than big-endian. Initially, I thought it was like that, because in little-endian sorting, more leading zeros appear with the bit mask encoded with VarInt. But this difference persists even with filters that have no such dependencies.

(There are many studies on the compression of sorted sets of integers, because they are the basic building blocks of search engines. However, I did not find much information about the compression of sorted records of constant length, and did not want to guess when presenting the data as integer values ​​with arbitrary precision.)

Oh-her, I cheated!


You may have noticed that the above BFS implementations in the pseudo-code return only boolean values ​​— a solution is found / not found. This is not particularly useful. In most cases, we will need to create a list of exact solution steps, rather than simply reporting the availability of a solution.

At first it seems that this problem is easy to solve. Instead of collecting sets of states, you need to collect state relations with parent states. Then, after finding a solution, you can simply go back through the list of parent decisions from the end to the beginning. For a hash-table solution, this would look something like this:

 def bfs(graph, start, end): visited = {start: None} todo = [start] while todo: node = todo.pop_first() if node == end: return trace_solution(node, visited) for kid in adjacent(node): if kid not in visited: visited[kid] = node todo.push_back(kid) return None def trace_solution(state, visited): if state is None: return [] return trace_solution(start, visited[state]) + [state] 

Unfortunately, this will destroy all the compression benefits obtained in the previous section; they are based on the assumption that adjacent lines are very similar. When we just look at the states themselves, this is true. But there is no reason to believe that this will be true for parent states; in fact, they are random data. Secondly, the solution with sorting + merging should read and write all the scanned states at each iteration. To save the state / parent state bundle, we will have to read and write to the disk at each iteration all this poorly compressible data.

Sort + merge with multiple output


At the very end, when you go back through the solution, the program will need only bundles of states / parent states. Therefore, we can simultaneously store two data structures. Visited will still remain a set of visited states, as before re-calculated during the merge. Parents is at most a sorted list of pairs of status / parent states that are not overwritten. After each merge operation, a pair of “state + parent state” is added to parents.

 def bfs(graph, start, end): parents = Stream() visited = Stream() todo = Stream() parents.add((start, None)) visited.add(start) todo.add(start) while True: new = [] for node in todo: if node == end: return trace_solution(node, parents) for kid in adjacent(node): new.push_back(kid) new_stream = Stream() for node in new.sorted().uniq(): new_stream.add(node) todo, visited = merge_sorted_streams(new_stream, visited, parents) return None # Merges sorted streams new and visited. New contains pairs of # key + value (just the keys are compared), visited contains just # keys. # # Returns a sorted stream of keys that were just present in new, # another sorted stream containing the keys that were present in either or # both of new and visited. Also adds the keys + values to the parents # stream for keys that were only present in new. def merge_sorted_streams(new, visited, parents): out_todo, out_visited = Stream(), Stream() while visited or new: if visited and new: visited_head = visited.peek() new_head = new.peek()[0] if visited_head == new_head: out_visited.add(visited.pop()) new.pop() elif visited_head < new_head: out_visited.add(visited.pop()) elif visited_head > new_head: out_todo.add(new_head) out_visited.add(new_head) out_parents.add(new.pop()) elif visited: out_visited.add(visited.pop()) elif new: out_todo.add(new.peek()[0]) out_visited.add(new.peek()[0]) out_parents.add(new.pop()) return out_todo, out_visited 

This allows us to take advantage of both approaches in terms of execution time and working sets, but requires more secondary storage space. In addition, it turns out that in the future, for other reasons, a separate copy of the visited states, grouped by depth, will be useful.

Swapping


In pseudocode, another detail is ignored: there is no explicit code for disk I / O, and only the abstract Stream interface is present. A stream can be a file stream or an array inside memory, but we ignored this implementation detail. Instead, pseudocode is creating a memory access pattern that allows optimal use of the disk. In an ideal world, this would be enough, and the OS virtual memory subsystem could do the rest.

But this does not happen, at least in Linux. At some point (before the working data set could be compressed to the size of the memory), I managed to get the program to run in about 11 hours and save the data mainly to disk. Then I made the program use anonymous pages, rather than those stored in files, and select a sufficiently large paging file on the same disk. However, three days later, the program went through a quarter of the way, and it still became slower over time. According to my optimistic estimates, she should have completed work in 20 days.

I will clarify - it was the same code and exactly the same access pattern . The only thing that has changed is that the memory was not saved as an explicit disk file, but as a swap. There is almost no proof that swapping completely undermines Linux performance, while ordinary file I / O does not. I always assumed that this was due to the fact that programs tend to read RAM as random access memory. But this is not the case.

It turns out that pages with file preservation and anonymous pages are handled by the virtual machine subsystem differently. They are stored in separate LRU caches with different expiration policies; moreover, it seems that they have different proactive read / load properties.

Now I know: swapping in Linux most likely will not work well even in optimal conditions. If parts of the address space are likely to be unloaded for some time onto a disk, then it is better to save them manually into files than to trust the swap. I accomplished this by implementing my own class of vectors, which initially works only in memory, and after exceeding a certain size threshold, switches to mmap to a temporary separate file.

Compress new states before merging


In the simplified performance model, we assumed that there would be 100 million new states for each depth. It turned out that this is not very far from reality (in the most difficult puzzle, a maximum of more than 150 million unique new states on one layer of depth). But this is not to be measured; The working set before the merge is associated not only with the unique states, but also with all the states derived for this iteration. This figure reaches a value of 880 million output states per layer depth. These 880 million states a) need to be processed with a random access pattern for sorting, b) cannot be efficiently compressed due to the lack of sorting, c) need to be stored together with the parent state. This working set is approximately 16 GB.

The obvious solution: use some kind of external sorting. Simply write all the states to disk, perform external sorting, deduplicate, and then merge as usual. At first, I used this solution, and although it mostly eliminated problem A, I could not cope with B and B.

In the end, I used an alternative approach: I collected the states into an array in memory. If the array becomes too large (for example, more than 100 million elements), then it is sorted, deduplicated and compressed. This gives us a packet of sorted state runs, and there are no duplicates within each run, but they are possible between runs. Fundamentally, the code for merging new and visited states remains the same; it is still based on a gradual passage through the streams. The only difference is that instead of just passing through two threads, there is a separate thread for each of the sorted new state runs.

Of course, the compression rates of these 100-million state runs are not as good as those of all the visited states. But even with such indicators, it significantly reduces the volume of both the working set and the requirements for disk I / O. It takes a bit more CPU resources to handle the queue of threads with priority, but still this is a great compromise.

Space saving on parent states


At this stage, the vast majority of the space occupied by the program is spent on storing parental states so that we can recreate its process after finding the solution. Most likely, they can hardly be well compressed, but maybe there is some kind of compromise between the CPU and memory?

We need to associate the state S 'at a depth D + 1 with its parent state S at a depth D. If we can iterate all possible parent states S', then we will be able to check if any of them appear at a depth D in the set visited . (We have already created a set of visited, grouped by depth as a convenient byproduct of the output of state / parent state bundles during a merge). Unfortunately, for this task, this approach will not work; it is simply too difficult for us to generate all possible states S for a given S '. However, for many other search tasks, this solution may work.

If we can generate state transitions only forward, not back, then why not just do this? Let's iteratively go around all the states at depth D and see what output states they have. If some output state gives S ', then we have found an appropriate S. The problem with this plan is that it increases the total consumption of processor resources by the program by 50%. (Not 100%, because on average we will find S by looking at half the states at depth D).

Therefore, I do not like one of the limiting cases, but at least a compromise between the CPU / memory is possible here. Is there a more acceptable solution somewhere in the middle? In the end, I decided not to store a pair (S ', S), but a pair (S', H (S)), where H is an 8-bit hash function. To find S for a given S ', we iteratively go around all the states at depth D. But before doing something else, we calculate the same hash. If the output does not match H (S), then this is not the state we are looking for, and we can just skip it. This optimization means that costly recalculations need to be performed only for 1/256 states, which is a slight increase in CPU load, and at the same time reduces the amount of memory for storing parent states from 8-10 bytes to 1 byte.

What did not work or may not work


In the previous sections, we looked at the sequence of high-level optimizations that worked. I tried other things that did not work, or that I found in the literature, but decided that in this particular case they would not work. Here is their incomplete list.

At this stage, I do not re-calculate the entire set of visited at each iteration. Instead, it was stored as a set of sorted runs, and these runs were compressed from time to time. The advantage of this approach is that fewer disk writes and CPU resources are used for compression. The disadvantage is increased code complexity and reduced compression ratio. Initially, I thought that such a scheme makes sense, because in my case write operations are more costly than reading. But in the end, the compression ratio was twice as large. The advantages of such a compromise are not obvious, so as a result I returned to a simpler form.

Some small research has already been done on performing volume search in width for implicitly defined graphs in the secondary repository, you can start exploring this topic.from this 2008 article . As you might guess, the idea of ​​performing deduplication together with sorting + merging in the secondary storage is not new. The surprising thing is that it was discovered only in 1993. Pretty late! There are more recent search suggestions in the secondary repository that do not require a sorting step.

One of them was to bind states to integers, and store in memory a bitmap of visited states. In my case, it is completely useless, because the sizes of the states being encoded are very different compared to the actually achievable spaces. And I very much doubt that there are interesting problems in which such an approach would work.

Another serious alternative is based on temporary hash tables. Visited states are stored without sorting in the file. We save the output data from depth D to the hash table. Then iteratively go around the visited states and look for them in the hash table. If the item is found in the hash table, then delete it. After iterative crawling of the entire file, only non-duplicable elements will remain in it. They are then added to the file and used to initialize the todo list for the next iteration. If the amount of output data is so large that the hash table does not fit in memory, then both files and hash tables can be divided into parts using the same criterion (for example, the upper status bits), and process each part separately.

Although there are benchmarks, showing that the hash-based approach is about 30% faster than sorting + merging, but it seems that they do not take compression into account. I just did not see how the rejection of the benefits of compression can justify itself, so I didn’t even experiment with such approaches.

Another area of ​​interest that was worth noting was the optimization of database queries. Seem to be. that the deduplication task is strongly associated with the join of databases, which also has the exact same dilemma “sorting vs. hashing”. Obviously, some of these studies can be transferred to the search task. The difference may be that the join output of the databases is temporary, while the BFS deduplication output remains until the end of the calculations. It seems that this changes the balance of trade-offs: now it concerns not only the most efficient processing of one iteration, but also the creation of the most optimal output format for the next iteration.

Conclusion


This concludes the statement of what I learned from the project, which is generally applicable to other search tasks in a brute force search. The combination of these tricks allowed us to reduce the amount of solutions to the most difficult puzzles of the game from 50-100 GB to 500 MB and smoothly increase costs if the task exceeds the available memory and is written to disk. In addition, my solution is 50% faster than naive state deduplication based on hash tables, even for puzzles that fit in memory.

Snakebird game can be purchased on Steam , Google Play and in the App Store . I recommend it to anyone who is interested in very difficult but honest puzzles.

Source: https://habr.com/ru/post/455537/


All Articles