Caution! Regexpa!

Do you often use regular expressions? Do you think about how justified their use is? What are the alternatives, what are the possibilities and limitations? What is the price of regexp?

I have long and often noticed that people (especially from the Perl world) tend to mystify regular expressions, endowing them (in their minds) with universal super-abilities.

This article, I urge you to think ~~again~~ .
')

Two misconceptions. And the first one

"Regular expressions are equally well suited for all tasks."

However, in simple tasks, regexps are ineffective.

I am silent about decisions like:

  if (/./) {print "not empty \ n";  }

obviously, it is less efficient than comparing with an empty string:

  if ($ _ ne "") {print "not empty \ n";  }

(by the way, these two conditions are not quite equivalent and this can conceal dirty tricks, which then suddenly come out at the most inopportune moment).

But there are flagrantly irrational decisions (I do not say that they are bad, but they are definitely not rational).

I offer a simple test (to determine whether the last but a dozen characters consist of some “a”)

  use Benchmark qw (: all);
 my $ a = 'a'x8000;
 cmpthese (1_000_000, {
   'regex' => sub {$ a = ~ / a {10}. {10} $ /;  },
   'noregex' => sub {substr ($ a, -20, 10) eq "aaaaaaaaaa";  },
 });

The result does not require comments:

  Rate regex noregex
 regex 414 / s - -100%
 noregex 4413793 / s 1065100% -

<Lyrical digression 1> Strictly speaking, I chose not a random example for the test, but the essence does not change; For those who are interested in optimizing regexp, I recommend reading Friedl's Regular Expressions. </ Lyrical digression 1>

<Lyrical digression 2> The fact that the appallingly inhibitory solution looks more elegant on a pearl should not force programmers to use an irrational approach to solving the problem. Perhaps this circumstance should make the programmer think: "and not to choose a language in which the optimal solution looks compact and beautiful." Here are two pieces of Python code:

  # option with regular expression (brake)
 import re
 cre = re.compile (r'a {10}. {10} $ ')
 if (cre.search (string)):
     # do something
 # option with explicit substring comparison
 if (string [-20: -10] == 'aaaaaaaaaa'):
     # do something

But that's another story. </ Lyrical digression 2>

And I will go to the second fallacy:

"Regular expressions are equally well suited for all tasks."

This time it will be about tasks that cannot be solved with regular expressions.

About four years ago, I was on an interview in a large company . The interview was generally dull a bit more, than-completely, but finally I got finished with the question: "Write a regular expression that checks the correctness of the placement of brackets." (That is, the absence of situations "{<}>".)

I immediately asked what the maximum depth of nesting brackets is allowed. The answer was bewilderment: "Any!".

Obviously, the questioner was absolutely sure that such an expression could be written, that I would write it now, and he would easily verify it.

What a bitter delusion!

If someone did not have time to figure it out, I will explain.

A regular expression describes a state machine. If you need to check an infinite number of brackets, then the finite state machine will not help you.

The question asked is akin to the question: "What is the sum of the angles of a triangle, if its sides are 1, 2 and 50 centimeters." He gives a complete ignorance of the subject.

Although strictly speaking

Perl developers back in 2004 (if I do not confuse) assured that this problem is solved. For such things, the construction of "(?? {...})" was invented. But the work of this construction very often leads to the ugly fall of Perl, usually with this message:

  panic: regfree data code 'b' during global destruction.

(the letter "b" depends on your encoding :-))

It is not surprising that these additions can not still get rid of the stigma of experimentation. This feature is not included in PCRE.

The developers did not stop there.

and recently introduced a new mechanism and syntax "(? 1)". The mechanism is free from the mass of clumsiness inherent in the old version.

But in my opinion, recursive regular expressions should have been done separately and not called them "regulars", because

using recursion in regular expressions makes them irregular .

They no longer describe the state machine, but describe a full-fledged Turing machine. On such regular expressions, you can solve any computational problem.

So Perl, in fact, lost regular expressions. Now the programmer cannot be sure that his “regular expression” will use the finite amount of memory. Or do not get stuck. (Classic regular expressions satisfy these requirements: they never loop and require a finite amount of memory, which is determined when the expression is compiled.) I belong to people who consider this refinement to be harmful, carrying an inexhaustible charge of adversities and vulnerabilities, and denying a programmer access to a robust regular expression mechanism.

But this is another story again. (And, by the way, the topic of recursion in regular expressions has already been covered on Habré.)

I just wanted to say: “Caution! Regekspy! ”

Thank you all and success!

Source: https://habr.com/ru/post/63944/

All Articles

Caution! Regexpa!

"Regular expressions are equally well suited for all tasks."

"Regular expressions are equally well suited for all tasks."

If someone did not have time to figure it out, I will explain.

Although strictly speaking

The developers did not stop there.

using recursion in regular expressions makes them irregular .

More articles: