The true power of regular expressions

As a frequent visitor to the PHP tag on StackOverflow , I often encounter questions about how to parse some specific aspects of HTML using regular expressions. The most common answer to this is:

“You cannot parse HTML using regular expressions, because HTML is not regular. Use XML parser, and you will be happy "

This statement - in the context of the question - lies somewhere between much misleading and completely wrong. What I want to try to demonstrate in this article is how powerful modern regular expressions really are .

What does “regular” really mean?

In the context of the theory of formal languages , something is called “regular” when it has a grammar in which any inference rule has one of the following forms:

B -> a
B -> aC
B -> ε

Here the symbol -> can be read as "the left side can be replaced with the right side." Those. The first rule will sound like this: " B can be replaced by a ", the second - " B can be replaced by aC ", and the third - " B can be replaced by an empty string" ( ε is the empty string character).

But what are B , C and a ? It is agreed that the letters in uppercase denote the so-called " non-terminals " ( non-terminals ) - symbols that can be disassembled into components - and the letters in lowercase denote "terminals" ( terminals ) - symbols that can not be broken further.
')
All of this may sound somewhat abstract, so let's look at an example: the definition of natural numbers as a grammar.

N -> 0
N -> 1
N -> 2
N -> 3
N -> 4
N -> 5
N -> 6
N -> 7
N -> 8
N -> 9
N -> 0N
N -> 1N
N -> 2N
N -> 3N
N -> 4N
N -> 5N
N -> 6N
N -> 7N
N -> 8N
N -> 9N

This grammar says that:

(N) -
... 0 9

... 0 9, (N)

In this example, the numbers from 0 to 9 will be terminals (since they cannot be broken into components), and N will be the only non-terminal (since it can be divided further).

If you look at the rules again and compare them with the definition of regular grammatical forms above, you will see that they meet the specified criteria: the first ten rules are of the form B -> a , and the second ten are of the form of B -> aC . Consequently, the grammar for natural numbers is regular .

Another point you could notice is that, despite the grammar description of such simple things, it is already quite bloated. Wouldn't it be better if we could express the same concept, but in a more concise form?

And here regular expressions come on the scene: the grammar above is equivalent to the regexp [0-9]+ (which is damn simpler). And this type of conversion can be applied to any regular grammar: each of them has a corresponding regular expression that describes all the valid lines for it.

What regular expressions can describe

So, a natural question arises: can regular expressions only describe regular grammars, or are they capable of more? The answer to it will be yes and no.

Regular expressions in the sense of formal grammars can (by definition) describe only regular grammars and nothing more.

But when programmers talk about “regular expressions,” they don't mean formal grammars. They talk about derived regular expressions implemented in their programming languages. And these implementations of regexp are very superficially connected with the initial concept of regularity.

Any modern type of regexp can describe much more than just regular languages. Clarifying how much more is the subject of further narration.

To preserve simplicity, I will focus only on the PCRE implementation of regexps. Just because I know it best (because it is used in PHP). Many other implementations are very similar to it, so most of the following can be applied to them too.

Language hierarchy

To analyze what is possible and what cannot be described using regular expressions, we first consider what types of languages still exist. A good starting point for this is Chomsky's hierarchy :

As you can see, it divides formal languages into four types: regular languages (type 3) - the least powerful, followed by context-free languages (type 2), context-dependent languages (type 1) and, finally, omnipotent unlimited languages (type 0).

Chomsky's hierarchy is container, so the small boxes in the picture are completely enclosed in large boxes. Those. any regular language is also context-free (but not vice versa!).

So let's go up a step up the hierarchy. We already know that a regular expression can describe any regular language. Is it possible for context-free languages?

(Reminder: when I say “regular expressions”, I mean them in the programmer sense, not in the sense of the theory of formal languages).

Context-free language description

The answer to the question above: yes, it is possible!

Consider the classic example of a context-free grammar {a^nb^n, n>0} , which means "The number of characters a followed by the same number of characters b ". (PCRE) regexp for this language will be:

/^(a(?1)?b)$/

This regular expression is very simple: (?1) refers to the first mask - (a(?1)?b) . Therefore, in principle, you can replace (?1) submasks, thus forming a recursive relationship:

/^(a(?1)?b)$/
/^(a(a(?1)?b)?b)$/
/^(a(a(a(?1)?b)?b)?b)$/
/^(a(a(a(a(?1)?b)?b)?b)?b)$/
# and so on

From the above considerations, it should be clear that this expression is capable of describing any string with the same number of a and b .

Therefore, regular expressions are able to describe at least some irregular, context-free grammars. But can they describe them all? To answer this question, let us first look at the definition of context-free grammars.

In a context-free grammar, all inference rules are:

A -> β

Here, A is again a non-terminal symbol, and β is an arbitrary string of terminals and non-terminals. Thus, each rule of a context-free grammar has a nonterminal on the left and a string of arbitrary characters on the right.

As an example, consider the following grammar:

 function_declaration -> T_FUNCTION is_ref T_STRING '(' parameter_list ')' '{' inner_statement_list '}' is_ref -> '&' is_ref -> ε parameter_list -> non_empty_parameter_list parameter_list -> ε non_empty_parameter_list -> parameter non_empty_parameter_list -> non_empty_parameter_list ',' parameter // ... ... ...

This is an excerpt from a PHP grammar (just a few simple rules). The syntax is slightly different from what we used before, but it can be easily understood. One aspect worth noting is that the T_SOMETHING names here are also terminal symbols. Such characters are usually called tokens , they encode a more abstract concepts. For example, T_FUNCTION is the function keyword, and T_STRING is the token tag (like getUserById or some_other_name).

I use this example to demonstrate one thing: context-free grammars are already powerful enough to encode fairly complex languages. That is why almost all programming languages have context-free grammars. This list also includes syntactically correct HTML.

Coming back to the actual question: can regular expressions link all context-free grammars? Again the answer: yes!

This is extremely easy to prove, since regular expressions (at least PCRE and others like it) provide a grammar syntax for building grammars that is very similar to the above:

 / (?(DEFINE) (?<addr_spec> (?&local_part) @ (?&domain) ) (?<local_part> (?&dot_atom) | (?"ed_string) | (?&obs_local_part) ) (?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) ) (?<domain_literal> (?&CFWS)? \[ (?: (?&FWS)? (?&dtext) )* (?&FWS)? \] (?&CFWS)? ) (?<dtext> [\x21-\x5a] | [\x5e-\x7e] | (?&obs_dtext) ) (?<quoted_pair> \\ (?: (?&VCHAR) | (?&WSP) ) | (?&obs_qp) ) (?<dot_atom> (?&CFWS)? (?&dot_atom_text) (?&CFWS)? ) (?<dot_atom_text> (?&atext) (?: \. (?&atext) )* ) (?<atext> [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+ ) (?<atom> (?&CFWS)? (?&atext) (?&CFWS)? ) (?<word> (?&atom) | (?"ed_string) ) (?<quoted_string> (?&CFWS)? " (?: (?&FWS)? (?&qcontent) )* (?&FWS)? " (?&CFWS)? ) (?<qcontent> (?&qtext) | (?"ed_pair) ) (?<qtext> \x21 | [\x23-\x5b] | [\x5d-\x7e] | (?&obs_qtext) ) # comments and whitespace (?<FWS> (?: (?&WSP)* \r\n )? (?&WSP)+ | (?&obs_FWS) ) (?<CFWS> (?: (?&FWS)? (?&comment) )+ (?&FWS)? | (?&FWS) ) (?<comment> \( (?: (?&FWS)? (?&ccontent) )* (?&FWS)? \) ) (?<ccontent> (?&ctext) | (?"ed_pair) | (?&comment) ) (?<ctext> [\x21-\x27] | [\x2a-\x5b] | [\x5d-\x7e] | (?&obs_ctext) ) # obsolete tokens (?<obs_domain> (?&atom) (?: \. (?&atom) )* ) (?<obs_local_part> (?&word) (?: \. (?&word) )* ) (?<obs_dtext> (?&obs_NO_WS_CTL) | (?"ed_pair) ) (?<obs_qp> \\ (?: \x00 | (?&obs_NO_WS_CTL) | \n | \r ) ) (?<obs_FWS> (?&WSP)+ (?: \r\n (?&WSP)+ )* ) (?<obs_ctext> (?&obs_NO_WS_CTL) ) (?<obs_qtext> (?&obs_NO_WS_CTL) ) (?<obs_NO_WS_CTL> [\x01-\x08] | \x0b | \x0c | [\x0e-\x1f] | \x7f ) # character class definitions (?<VCHAR> [\x21-\x7E] ) (?<WSP> [ \t] ) ) ^(?&addr_spec)$ /x

What you see above is a regular expression for describing an e-mail address using RFC 5322 . It is built using a simple BNF rule mapping from RFC to notation, which PCRE understands.

The syntax is extremely simple: all definitions of rules are wrapped in a DEFINE statement, which basically means that all these rules do not have to directly correspond to something, they simply need to be declared. Only a part of ^(?&addr_spec)$ at the very end determines what is described here at the end.

The definitions of the rules, in general, are not real "rules", but rather subtleties. In the first example (a(?1)?b) 1 referred to the first mask. With a lot of submasks, this kind of naming is impractical, so they are given clear names. So (?<xyz> ...) defines a template named xyz . (?&xyz) - link to it.

Also note the following fact: the regular expression above uses an x modifier. He instructs the engine to ignore spaces and stick to # -style comments. This way you can better format your regexp so that other people can understand it properly. (Unlike this regexpa RFC 822 for an e-mail address ...)

Thus, the above syntax allows you to simply map the grammar to a regular expression:

A -> BC
A -> CD
// becomes
(?<A> (?&B) (?&C)
| (?&C) (?&D)
)

The only catch is that regular expressions do not support left-sided recursion. That is, if you take the above definition of the parameter list:

 non_empty_parameter_list -> parameter non_empty_parameter_list -> non_empty_parameter_list ',' parameter

You cannot convert it to grammar-based regexp exactly. The following will not work:

(?<non_empty_parameter_list>
(?¶meter)
| (?&non_empty_parameter_list) , (?¶meter)
)

The reason for this is that the non_empty_parameter_list appears in the far left side of its own definition. This is called left-sided recursion and is very often found in grammar definitions. The reason is that LALR (1) parsers, which are most often used for parsing, usually handle left-sided recursion better than right-sided.

But do not worry, this does not in any way affect the power of regular expressions. Each left-recursive grammar can be converted to right-recursive. In the example above, simply swap the two parts:

 non_empty_parameter_list -> parameter non_empty_parameter_list -> parameter ',' non_empty_parameter_list

Now it should be absolutely clear that regular expressions are capable of describing any context-free language (and, therefore, almost all languages encountered by programmers). The only problem is that although regular expressions can describe context-free grammars, they usually cannot parse them. Parsing involves converting a string to an abstract syntax tree. For this, it is impossible to use regular expressions, at least PCRE (although, of course, in Perl, where you can insert arbitrary code into a regular expression, you can do almost everything ...).

However, the above DEFINE forexp was very useful for me. In the end, you usually do not need full support for parsing, you just want to describe something (for example, an e-mail address) or extract small pieces of data (rather than build a whole parse tree). Most of the complex processing tasks can be made much easier using regexp-based grammars :)

And now let me again focus on what I mentioned earlier in passing: syntactically correct HTML is context-free. Therefore, you can describe it using regular expressions, as opposed to popular opinion. Just do not forget about two things: first, most of the HTML you encounter is not syntactically correct (usually, it is not even closed). And secondly, just because you can , does not mean that you should . You can write your software on Brainfuck, but there are reasons why you don’t, right?

My opinion on the subject is this: wherever you need general HTML processing, use the DOM library to your taste. It will correctly process incorrect HTML and take upon itself the brunt of the parsing. On the other hand, if you are dealing with specific cases, then fast regular expressions are often the best solution. And I have to admit: although I usually tell people not to parse HTML using regular expressions, I myself, as we know, do this often. Just because I often come across specific situations when using regexps is elementary easier.

Context-sensitive grammar

Now that we have examined context-free languages in detail, let's go up one more step in the Chomsky hierarchy: to context-dependent languages.

For them, the structural rules will have the following form:

αAβ → αγβ

This mixture of characters initially looks quite complicated, but in fact everything is simple. The core is still the A → γ pattern, which we defined for context-free grammars. To it, α and β were simply added in both directions. These two form the context (which also gave the name to this class of grammar). So, A can now be replaced by γ only if it has α on the left and β on the right.

To make the above more understandable, we try to interpret the following rules:

ab A -> abc
a B c -> a QH c
HB -> HC

In Russian it will sound like this:

`A` `c`, , `ab` .
`B` `QH`, , `a` `c` .
`B` `C`, `H` .

Context-sensitive languages are something you rarely see in “normal” programming. They are more important in the field of natural language processing (since natural languages are definitely not context-free. Words have different meanings depending on the context). But even people engaged in the processing of natural languages work with so-called “soft context-dependent languages”, since they are sufficient for modeling, and analysis is much faster.

To understand how powerful context-sensitive grammars are, let's look at another class of grammars with exactly the same expressive power as context-dependent grammars: non-shortening grammars.

For non-shortening grammars, each inference rule has the form α ->β , where α and β are arbitrary strings of characters with a single restriction: the number of characters on the right must be no less than the number of characters on the left. Formally, this is expressed by the formula |α| <= |β| |α| <= |β| where |x| denotes the length of a string of characters.

So, non-shortening grammars allow any rules until they start shortening the input string. Those. ABC -> HQ will be an invalid rule, since its left part contains three characters, and the right - only two. Therefore, this rule will be shortening. On the other hand, the opposite rule HQ -> ABC appropriate because it has more characters to the right than to the left, which lengthens the string.

These relationships (equivalent for context-sensitive and non-extending grammars) make it absolutely clear that with the help of context-dependent grammars you can do almost everything. Just do not shorten :)

To get a notion of why both grammars have the same expressive power, look at the following example of transformations:

//
AB -> CD
// -
AB -> AX
AX -> YX
YX -> YD
YD -> CD

Okay, back to regular expressions. Can they also describe context-sensitive languages?

This time I can’t give you a definite answer. Of course, they can describe some context-sensitive languages, but I have no idea if they can describe them all .

As an example of a context-sensitive language, which can be easily described using regexps, let us take the context-free language modification {a^nb^n, n>0} , which we discussed above. When you change it to {a^nb^nc^n, n>0} , i.e. If some quantity a precedes the same quantity b and c , then it becomes context-sensitive.

PCRE-regexp for this language is as follows:

 /^ (?=(a(?-1)?b)c) a+(b(?-1)?c) $/x

If we ignore the statement (?=...) , then we will only have a+(b(?-1)?c) left on our left. It checks that an arbitrary number a followed by the same number b and c . (?-1) - relative reference to a mask, meaning "the last defined mask." In this case, it is (b(?-1)?c) .

A new entity for us is (?=...) . It is called the “conditional expression of zero width” and checks that the text following it is described by a mask, but the check itself is carried out by its subexpression. Thus, there is a check for compliance with both templates simultaneously. Part a+(b(?-1)?) c a+(b(?-1)?) checks that the number of b and c same, and (a(?-1)?b)c - that the number of a and b same. Both masks together provide confirmation that all three characters are contained in the same quantity.

In the above regexp, you can also see how the concept of context is implemented in regular expressions using assertions. If we now return to the definition of a context-dependent grammar, then you can say that the output rule of the form

αAβ → αγβ

can be converted to the following DEFINE regexp rule:

(?<A> (?<= α ) γ (?= β ) )

This is the same as saying that A is γ , but only if it has α left and β right of itself.

Looking at the above, you might think that you can now easily convert context-sensitive grammar into a regular expression, but in fact it is not. The reason is that the retrospective statement ((?<= ... )) has one significant limitation: it must be of a fixed length. This means that the length of the text described by the statement must be known in advance. Those. you can write (?<= a(bc|cd) ) , but you cannot write (?<= ab+) . In the first case, the statement binds exactly three characters in all cases and, therefore, has a fixed length. In the second case, the statement may link ab , abb , abbbb , etc. They all have different lengths, and the engine does not know when it should begin to match them. So they are simply banned.

This is a serious blow to the ease of converting context-dependent grammars into regexps. Virtually all such grammars require variable widths for retrospective statements.

But the fact that it is impossible to directly convert a context-dependent grammar to regexp does not in itself mean that regular expressions can not describe them all.For example, the above language {a^nb^nc^n, n>0}also has a grammar that requires retrospective statement of variable width. Perhaps something similar is possible for other context-sensitive grammars. I honestly do not know.

So what can we say in the end? Regexps can describe at least some context-dependent languages, but it is not known whether they are able to describe them all .

Unlimited Grammar

The next class of grammars in the Chomsky hierarchy is unlimited grammars. The set of languages that can be formed with their help includes all recursively-enumerable languages.

Little can be said about unlimited grammars, since they, uh, unlimited. Their inference rules have the form α -> β, where αand βare character strings with no restrictions at all.

Therefore, unlimited grammars simply remove the “non-shortening” part of the non-shortening grammars. Therefore, it ABC -> HQwill be a valid rule for them , although it was not previously a rule.

How strong are unlimited grammars exactly? So much so that they cannot be stronger: they are Turing-full. There is even a "programming language" based on unlimited grammars: Thue. Since it is Turing-complete, it can do everything that other programming languages are capable of.

One of the consequences of Turing completeness is that the problem of checking for belonging to a certain string to a certain grammar is generally undecidable.

Unfortunately, I can't tell you anything else about how regular expressions and unlimited grammars are related. Damn, I could not even find an example of an adequate unlimited grammar (which would not be non-shortening).

But now, since we are talking about Turing completeness, let's move on to the next point:

Regular expressions with backlinks are NP-complete

Here is another very powerful regular expression property that I didn’t mention earlier: backlinks.

Those.containing such a very simple regexp:

/^(.+)\1$/

(.+)and \1describe the same arbitrary text. In general, \nmeans "something there, described by the n-th submask." Those.if it (.+)describes foo, then it \1will also describe foonothing more. Therefore, the expression (.+)\1means "some text, followed by a copy of it".

What this regexp describes is called “copy language” and is another typical example of context-dependent languages.

In the same way, you can describe other examples of grammars listed above using backlinks:

 # {a^nb^n, n>0} (context-free) /^ (?: a (?= a* (\1?+ b) ) )+ \1 $/x # {a^nb^nc^n, n>0} (context-sensitive) /^ (?: a (?= a* (\1?+ b) b* (\2?+ c) ) )+ \1 \2 $/x

Explaining how this works is beyond the scope of this article, but you can read about this wonderful topic on StackOverflow .

As you can see, simply adding a backlink (without the support of recursive submasks) already increases the power of regular expressions. It is in principle so powerful that it makes the description of regular expressions NP-complete task.

What does NP-full mean? NP-completeness is one of the classes in the theory of computational complexity for solving problems, into which many "hard" problems fall. Examples of NP-complete tasks are the traveling salesman problem (TSP), the Boolean formula satisfiability problem (SAT), and the knapsack problem (BKP).

Another important condition for calling a task NP-complete is the ability to reduce to it any other NP-task. Thus, all NP-complete tasks are basically interchangeable. If you find a quick solution to one of them, then you will have a quick solution for them all.

So if someone finds a quick solution for the NP-complete problem, then almost all the complex computational problems of mankind will be solved in one fell swoop. That, as we know, will mean the end of civilization.

To prove that regular expressions with backlinks are exactly NP-complete, you can simply take one of the well-known NP-complete problems and prove that it can be solved using regular expressions. As an example, I chose the 3-CNF SAT task.

The 3-CNF SAT stands for “The task of satisfying Boolean formulas in 3-conjunctive normal form” and is simple enough to understand. There is a Boolean formula of the following form:

(!$a || $b || $d)
&& ( $a || !$c || $d)
&& ( $a || !$b || !$d)
&& ( $b || !$c || !$d)
&& (!$a || $c || !$d)
&& ( $a || $b || $c)
&& (!$a || !$b || !$c)

It consists of a series of conditions separated by the operators I. Each of these conditions consists of three variables (or their negations) separated by the operators OR. 3-CNF SAT asks: Is there a solution to this boolean formula (for example, true)?

The above boolean formula can be converted to the following regular expression:

 <?php $regex = '/^ (x?)(x?)(x?)(x?) .* ; (?: x\1 | \2 | \4 ), (?: \1 | x\3 | \4 ), (?: \1 | x\2 | x\4 ), (?: \2 | x\3 | x\4 ), (?: x\1 | \3 | x\4 ), (?: \1 | \2 | \3 ), (?: x\1 | x\2 | x\3 ), $/x'; $string = 'xxxx;x,x,x,x,x,x,x,'; var_dump(preg_match($regex, $string, $matches)); var_dump($matches);

, $matches :

array(5) {
[0]=> string(19) "xxxx;x,x,x,x,x,x,x,"
[1]=> string(1) "x"
[2]=> string(1) "x"
[3]=> string(0) ""
[4]=> string(0) ""
}

, , $a = true , $b = true , $c = false $d = false .

: x , . - (?: \1 | x\3 | \4 ) , \1 — x (), \3 — (), \4 — x ().

. , , .

Summarizing

, :

, , .
( PCRE) - . HTML, .
- .
NP . NP , .

But do not forget: from what you can , it does not follow that you should . HTML processing with regular expressions is a really bad idea in some cases. And in others, it is perhaps the best thing to do.

Just think for what the simplest solution is for your specific task, and use it. If you choose to solve a problem using regular expressions, do not forget about the x-modifier, which allows you to make the format of your regexps more beautiful. For complex regular expressions, just remember to use DEFINE statements and named submasks to keep your code clean and readable.

That's all.

From the translator: possible comments on the translation, please write in a personal. I will be very grateful for them.

Source: https://habr.com/ru/post/171667/

All Articles