We study the algorithm of regular expressions in Ruby

According to Wikipedia, Oniguruma means "devil's chariot" in Japanese.

We are all familiar with regular expressions. They are the "Swiss army knife of the developer." Whatever you are looking for, whatever text you can make out, you can always do it using regular expressions. In fact, you probably started using them much earlier than you began to use Ruby - they have long been included in most popular programming languages: Perl, JavaScript, PHP, Java, and others. Ruby appeared in the mid-1990s, whereas regular expressions were still in the 1960s, that is, almost 30 years earlier!

But how do regular expressions actually work?

If you want to learn more about the information technologies behind the regular expression engine, you must read the fantastic series of articles from Russ Cox:
')

I do not want to repeat everything that Russ wrote. But I quickly note that Ruby uses the “Non-recursive Backtracking Implementation” (implementation with non-recursive backtracking), which is discussed in his second article, which leads to an exponential performance degradation, just like with regular expressions in Perl . In other words, Ruby does not use the most optimal Thompson NFA (Thompson's algorithm for non-deterministic finite automata), which Russ talks about in his first article.

Today, I'm going to take a close look at Oniguruma, a regular expression engine that is built into Ruby. Using some simple graphic schemes and a few examples of regular expressions, I will illustrate how it works. Read on to get an idea of what is happening inside Ruby every time you use regular expressions, probably there are a lot of things that you didn't even realize.

Oniguruma

Starting with version 1.9, MRI uses a slightly modified version of the open C library called "Oniguruma", which is also used in PHP, for regular expression processing. Along with support for all standard regular expression functions, it also supports multibyte characters, such as Japanese text.

At a very high level, this is what happens when a regular expression passes through Oniguruma:

In the first step, Oniguruma reads a regular expression pattern, breaks it into lexemes and translates it into a tree structure containing a set of syntax nodes - Abstract Syntax Tree (AST). This is very similar to how the interpreter itself processes your Ruby code. In fact, you can think of the Oniguruma regular expression mechanism as a second programming language built right into Ruby. Whenever you use regular expressions in code, you really create a second program in a completely different language. After parsing the regular expression pattern, Oniguruma compiles it into a set of instructions, which are then executed on the virtual machine. These instructions implement the state machine that Russ Cox describes in his articles.

Then, when you actually search for a regular expression, the Oniguruma virtual machine executes it by simply processing the instructions that were previously compiled, applying them to the specified string.

This week, to understand how Oniguruma works, and compare it with what Russ Cox is writing about, I rebuilt Ruby 2.0 with the ONIG_DEBUG_COMPILE and ONIG_DEBUG_MATCH flags set. Setting these flags allowed me to see a lot of debug messages when I ran a search through several regular expressions. In them, I saw what instructions the virtual machine had compiled into the templates, and how it performed them. That's what I found ...

Example 1: Matching String

Here is a very simple Ruby script that searches for the word “brown” in a given string:

str = "The quick brown fox jumps over the lazy dog." p str.match(/brown/)

When I run this code with Ruby's reassembled interpreter, I observe a lot of additional debugging output (much more than I show):

 $ ruby regex.rb PATTERN: /brown/ (US-ASCII) optimize: EXACT_BM exact: [brown]: length: 5 code length: 7 0:[exact5:brown] 6:[end] match_at: str: 140261368903056 (0x7f912511b590) etc ... size: 44, start offset: 10 10> "brown f..." [exact5:brown] 15> " fox ju..." [end] #<MatchData "brown">

The key point is the following " 0:[exact5:brown] 6:[end] " - this line describes two virtual machine commands compiled by Oniguruma when compiling the / brown / template. Here is what this program looks like:

You can think of this scheme as a finite automaton for searching / brown /:

exact5:brown checks the specified string at the current position for the presence of 5 letters "brown";
end means that the search is completed, returns matches and stops execution.

When performing a regular expression search, Oniguruma executes these instructions on the virtual machine with the specified string. Let's take a look at how this works in my example. First, " exact5:brown " is applied to the given string in the place where the word "brown" is located:

How does Oniguruma know where to start the line? It turns out that it contains an optimizer that decides where to start the search based on the specified string and the first instruction of the virtual machine. You can see it above: " optimize: EXACT_BM.. exact: [brown]: length: 5.. start offset: 10 ". In this case, since it was precisely known that the word “brown” should be searched for, Oniguruma jumped to the position where this word appears for the first time. Yes, I know that this sounds like a hack, but in reality it’s just a simple optimization of search acceleration in general regular expressions.

Then, Oniguruma executes the instruction " exact5:brown ", checking whether the next 5 characters match the pattern or not. Since they match, Oniguruma moves to the position after them and executes the following instruction:

Now the last instruction is executed, which simply means that everything is complete. Whenever the virtual machine executes " end ", it stops execution, declares success, and returns a matching string.

Example 2: Matching one of two strings

Let's take a more complex example and see what happens. In this template, I want to look for an entry in the string either “black” or “brown”:

 str = "The quick brown fox jumps over the lazy dog." p str.match(/black|brown/)

Run again:

 $ ruby regex.rb PATTERN: /black|brown/ (US-ASCII) optimize: EXACT exact: [b]: length: 1 code length: 23 0:[push:(11)] 5:[exact5:black] 11:[jump:(6)] 16:[exact5:brown] 22:[end] match_at: str: 140614855412048 (0x7fe37281c950), ... size: 44, start offset: 10 10> "brown f..." [push:(11)] 10> "brown f..." [exact5:black] 10> "brown f..." [exact5:brown] 15> " fox ju..." [end] #<MatchData "brown">

Again, the key here is " 0:[push:(11)] 5:[exact5:black] 11:[jump:(6)] 16:[exact5:brown] 22:[end] ". This is a virtual machine program that searches for our template / black | brown /:

It's all a bit more complicated! First of all, you may notice that the optimizer is now looking only for the letter “b”: " optimize: EXACT.. exact: [b]: length: 1 ". This is because the words “black” and “brown” both begin with this letter.

Now let's step by step analyze the execution of this program:

The " push " command is executed at the position of the first finding of the letter "b". It passes the following command and place in the source line to what is called the “Backtrack Stack” (backtrack):

We'll see later that the “Backtrack Stack” is a key place in Oniguruma’s work. It uses it to find alternative search paths in a given string, if the first path did not lead to a result. Let's continue with this example, and you will see what I mean.

So, we are going to proceed to the execution of the " exact5:black " command, but in the line there is only the word "brown". This means that no matches will be found, and the search will not be successful. However, before returning the result, Oniguruma will check the stack for alternative search paths. In this case, there is also " exact5.brown " - the second part of my regular expression / black | brown /. Now Oniguruma pulls out this command and continues execution:

Now there is a match here, so Oniguruma moves 5 characters and proceeds to the next instruction:

We again reached the " end " command, so we simply return the matched value.

Example 3: Construction *

Now my last example. Let's see what happens when I use the following regular expression:

 str = "The quick brown fox jumps over the lazy dog." p str.match(/brown.*/)

We want to find the entry "brown", and then any sequence of characters to the end of the line. Let's see the debug output:

 $ ruby regex.rb PATTERN: /brown.*/ (US-ASCII) optimize: EXACT_BM exact: [brown]: length: 5 code length: 8 0:[exact5:brown] 6:[anychar*] 7:[end] match_at: str: 140284579067040 (0x7f968c80b4a0), ... size: 44, start offset: 10 10> "brown f..." [exact5:brown] 15> " fox ju..." [anychar*] 44> "" [end] #<MatchData "brown fox jumps over the lazy dog.">

And here we have a new state machine:

This time you can see the new Oniguruma virtual machine instruction: " anychar* ". As you can guess, it represents the syntax pattern " .* ". Let's look again at what happens at each step:

We started again from the 10th position in the line, where “brown” appears for the first time, and again found a match, as a result, Oniguruma goes further and proceeds to the following instruction:

The following instruction is " anychar* ", and this is how it works:

First, it matches any character, so it always shifts execution by one character;
But, instead of moving to the next command, Oniguruma goes back and executes this instruction again, but for a new symbol;
It also writes the current character and the following command to the stack, in this case the " end ".

Now Oniguruma just goes through the whole part of the line, repeating these instructions for each "brown fox jumps over the lazy dog." As it goes through the rest of the line, it again and again writes the " end " instruction end stack:

And again:

Finally, it reaches the end of the original line:

This time, " anychar* " does not match, since there are no more characters in the string. What happens in such cases when the team cannot find a match? As in the previous example, Oniguruma will take a command from the stack and continue searching with it. Thus, in this case, the " end " command will be executed, indicating the last position of the match in the string. This means that Oniguruma will receive all the text to the end of the line "brown fox jumps over the lazy dog."

But why put the " anychar* " instruction on the stack after each character? The reason is that if there were some more patterns after the " .* ", Or " .* " Embedded in a more complex general expression, it would not be clear which of the segments of the original string would ultimately lead to complete coincidence . Perhaps Oniguruma would have to try all the options. In this simple example, the template is valid to the end of the line, so there is no need to get more than one command from the stack.

One interesting detail. If you add more commands after " .* ", For example, if you search for /.*brown/, then Oniguruma will not use the " anychar* " instruction. Instead, it will use another similar command, for example, " anychar*-peek-next:b ". It works just like " anychar* ", but instead of putting the next position on the stack on the stack each time, it puts on the stack only the coincidence position, in this case, with "b". This optimization works because Oniguruma knows that the next character must be “b”.

Pathology of regular expression patterns

I mentioned above that Ruby will show the same bad work as Perl if you give it a very complex regular expression pattern. In fact, it is very easy to reproduce it on your computer. Try a very simple regular expression search example:

 str = "aaa" p str.match(/a?a?a?aaa/)

It should execute very quickly:

 $ time ruby regex.rb #<MatchData "aaa"> ruby regex.rb 0.02s user 0.01s system 30% cpu 0.080 total

However, if you repeat it using 29 repetitions instead of 3, then the time to find an entry will have explosive growth, as Russ shows on the graph on the left in his first article:

 str = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" # 29  p str.match(/a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/)

Execution with 29 occurrences:

 $ time ruby regex.rb #<MatchData "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"> ruby regex.rb 17.09s user 0.01s system 99% cpu 17.098 total

Or with 30 entries:

 $ time ruby regex.rb #<MatchData "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"> ruby regex.rb 34.00s user 0.01s system 99% cpu 34.025 total

For 31 entries, the time is 67 seconds and increases exponentially. The reason for this is that the algorithm that Oniguruma uses goes through a list of possible combinations that grows exponentially relative to the length of the regular expression pattern. This would not have happened if Oniguruma and Ruby used the Thompson algorithm that Russ was talking about!

Superficial explanation

As I said, here is just a superficial explanation of what Oniguruma and Ruby can do. As you probably know, there are many, very many, operators and options for regular expressions that can be used, each of them has corresponding instructions inside the virtual machine. In addition, my examples were extremely simple and straightforward - in ordinary applications, you can have very complex regular expressions that are compiled into hundreds of different Oniguruma virtual machine instructions, which creates many different ways when searching for the right expression.

However, the basic idea will always be the same. Each regular expression is compiled into a series of virtual machine instructions, which is a finite state machine. When Oniguruma comes to a standstill in the search, it will choose another variant of the path to the target from the stack, each of which could have been left by the " anychar* " operator or another similar command.

Source: https://habr.com/ru/post/231017/

All Articles