⬆️ ⬇️

Regular expression search with suffix array

image



Back in January 2012, Russ Cox published a wonderful blog post explaining how Google Code Search works using the trigram index.



By this time, the first versions of my own source code search system called livegrep , with a different indexing method, had already been released; I wrote this system independently of Google, with the help of several friends. In this article, I would like to present a slightly late explanation of the mechanism of its work.



Suffix Arrays



Suffix array is a data structure used for full-text search and for other applications, mainly in the field of bioinformatics.

')

The concept of a suffix array is quite simple: it is a sorted array of all suffixes of a string. So, for the string "here and there"



0 1 2 3 4 5 6 7 8 9 10     _  _     


we could build the following suffix array:





4 _

6 _

10

3

9

2

5

7

0

1

8




Note that there is no need to keep the suffix in each element of the array as a whole: you can store only the indices (left column) and the original row and, if necessary, search for the desired index in the row. So this array will be stored as follows:



 [4 6 10 3 9 2 5 7 0 1 8] 


There are algorithms that allow you to quickly build suffix arrays (in O (n) time) or convert the original array to suffix (using a constant amount of additional memory outside the array itself).



Full-text search by substring



One of the main applications of the suffix array is full-text search.



If we are looking for a substring in the corpus of texts, then such a substring, if it exists, will be a prefix of some corpus suffix. That is, if we build a suffix array, any hull substring must be the beginning of some element of the array. For example, if we search for “yes” in the “to and fro” line, we will find that it appears twice in this line, and also that it starts with two lines in the array above.



   2    9  


But since the suffix array is sorted, we can quickly find these elements using the binary search method, and the indices will show us the place of the required substring in the source text.



Towards a regular expression search



Before you start searching by source code, let's learn how to use regular expressions.



In the process of indexing, livegrep reads all sources and combines them into a huge buffer (livegrep builds a suffix array using the open library libdivsufsort ; older versions of livegrep used bitwise sorting (radix sort), which on some sets could have quadratic complexity - with the implementation of divsufsort build speed index increased significantly). Then the so-called “file content map” is stored in memory - a sorted table of the form ( , , ) , which allows you to determine from which file certain bytes came to the buffer.



(The mechanism is described simply, in fact, in livegrep, instead of one giant suffix array, several are used; in addition, we compress the input data by deduplicating identical strings, which complicates the file content map. We may consider these details in a future post).



But how to apply this structure to quickly find matches for regular expressions?



The first thing that comes to mind is the following idea: you can find literal substrings in a regular expression, find all such substrings in the suffix array, and then search for their location in the body.



For example, take the regular expression /hello.*world/ . Obviously, all the required substrings will contain the word “hello”, which means we can find all the strings with this word, and then check them with a regular expression.



More complex search



It turns out we can do better. The structure of the suffix array is such that in addition to searching for a substring, you can perform at least two more basic queries on it:





It is also possible to search for multiple items at once and combine the results. For example, for the regular expression /hello|world/ we will find matches for “hello” separately, separately for “world”, and then we will look at the locations of both words in the text.



In addition, we can apply a combination of all of these strategies. For example, the search for the expression /[af][0-9]/ will be performed as follows:



  1. Binary search to find af
  2. Division into 6 blocks, one for a, b, c, d, e and f
  3. Inside each of the blocks, we perform a binary search on the second symbol and find those blocks where the second symbol belongs to the range [0-9]


Example
one.

...

A ...









F ...

...

2

...

A ...

B ...

C ...

D ...

E ...

F ...

...

3

...

A ...

A [0-9] ...

B ...

B [0-9] ...

C ...

C [0-9] ...

D ...

D [0-9] ...

E ...

E [0-9] ...

F ...

F [0-9] ...

...



As a result, we get a set of segments of the suffix array, whose elements indicate substrings corresponding to /[AF][0-9]/ .



This essentially means that responses to requests can have the following structure (in the Go language syntax):



 type IndexKey struct { edges []struct { min byte max byte next *IndexKey } } 


In livegrep, almost the same structure is applied, with the exception of some additional fields that serve to analyze regular expressions.



For each edge in the query, we find all suffixes starting with characters from this range, split the range into separate characters, and then recursively evaluate next and delve into the suffix by one character.



Livegrep parses the regular expression and finds the IndexKey with the following property: any substring corresponding to the regular expression must match this IndexKey .



In many cases, this is simple: a class of characters is easily converted into a set of ranges, an alphabetic string is a linear key chain of an IndexKey with ranges of one character, and so on. Everything gets complicated when we see repetition or disjunction operators (|). I hope to write about this in more detail in a future post, but for now, if you're curious, you can read indexer.cc or experiment with analyze-re , which has a dot output mode showing the result of a livegrep analysis.



Application of results



Passing through the suffix array in the manner described above, we get (possibly a very large) set of indices in the body that we need to find. Instead of searching each one separately, livegrep takes all the matches and sorts them into memory. When we go through the ordered list, and find several matches close to each other, we apply one regular expression to the entire segment at once.



Livegrep matches regular expressions using Russ Cox's own RE2 library. RE2 not only works fast enough, but, unlike PCRE or most other libraries for working with regular expressions, converts a regular expression into a finite state machine, performing the task in guaranteed linear time.



By grouping the found matches, we use the speed of RE2 to simultaneously process large chunks of text, which allows us not to manage queries at a low level and not to store a lot of redundant information.



To determine the search range around a potential match, recall what the livegrep looks for in source code lines: we can use a simple memchr to find the nearest newline characters, and find exactly which line of code can contain the search expression.



After we run RE2 over all positions containing potential matches, we get the final list of matches found in the package. For each match, using the file content map mentioned above, we find a file containing these bytes. To determine the line number, pull out the entire contents of the file and count the line break characters.



If we limit the search to a specific file (for example, file:foo\.c ), we go through the file content map at the same time as the list of results after passing through the indices, and remove records from it if the file containing them does not match the file from the query.



An interesting feature of this approach is that the restriction on the file name actually reduces the search area slightly - the livegrep still goes through the entire suffix array and still considers every match found (although it could check the file content map much faster and not call RE2 ). However, Livegrep is so productive that it does not have to take advantage of the restriction on the file name in order to produce results quickly - so it needs to be in order to be able to process requests without specifying a specific file.



From this it follows that livegrep will process most slowly those requests that strictly limit the path to the file and at the same time inefficiently use the suffix array:. . file:zzzz is probably one of the slowest requests to send to livegrep for today.



To be continued



We reviewed livegrep work only in general terms. Next time, I’ll tell you in more detail how we build the index query and transform the regular expression into the index query, and then I finally tell you how the suffix array and file-content structures work in livegrep compared to the simplified version described here. In particular, livegrep actually significantly compresses the input data, which reduces memory consumption and speeds up the search, at the cost of complicating the construction of the index and the processing of results.



Oh, and come to work with us? :)
wunderfund.io is a young foundation that deals with high-frequency algorithmic trading . High-frequency trading is a continuous competition of the best programmers and mathematicians of the whole world. By joining us, you will become part of this fascinating fight.



We offer interesting and challenging data analysis and low latency tasks for enthusiastic researchers and programmers. Flexible schedule and no bureaucracy, decisions are quickly made and implemented.



Join our team: wunderfund.io

Source: https://habr.com/ru/post/325036/



All Articles