📜 ⬆️ ⬇️

General tips for creating optimal regular expressions

Regular expressions are an integral part of any data processing tool.
It is logical that different syntaxes and different functionalities are supported in various variations.
In spite of this, the principles of regular expressions themselves work, regular expression machines and basic optimization settings are almost the same.
Somewhere in the network, I saw a completely stupid statement that “regular expressions are not suitable for solving irregular data” or something similar. Complete nonsense.

Another thing is that the more intelligent the search pattern, the stronger the regular expression machine “strains” in order to “explain” to the text what is ultimately required of it.
By the way, in the classic Friedl book there is a whole chapter on optimizing regular expressions. Its important part is the structure of the chapter itself. First, we consider the general principle of the optimization of expressions themselves, and then the optimization as applied to programming languages. Those. There are several stages in optimization, do not forget this.

For the expressions themselves there are a few obvious rules:
  1. Literal search is the fastest - / aa / usually faster / a {2} / or / [a] {2} /
  2. Finding data of indefinite length is helpful (at least for HKA versions). If you know the size of the data - specify it at least approximately. / \ w {3,500} / faster / \ w + /
  3. The presence of parentheses (and non-preserving! Too) slows down the search. If you do not need information - never enclose a segment in saving brackets and try to rewrite the template for refusal and from non-saving ones. The requirement is sometimes impossible, but it is worth keeping it in mind. In addition, you should not try to “help” the machine of regular expressions with search constructions like /(?:.*)one/ , this will only confuse the automaton.
  4. The presence of executable code in the “body” of the expression itself or in the “body” of the substitution expression is ruinous. Use this technique only if you are absolutely convinced that there is no other way out. Suppose the idea of ​​“saddling” an implicit loop of an expression with the / g switch is often (JavaScript) exactly a waste of time. “Native” cycles of the base language are more likely to be an order of magnitude faster than the input-output during the operation of the regular expression machine.
  5. Expressions “replacements” are much slower than search expressions. In the case of replacing large pieces of text, it makes sense to think about an alternative option with the creation of a new version of the text. It may be difficult to implement, but it makes sense to at least think about such a problem.
  6. Alternatives slow down the search, alternatives of unknown length slow it down so much that sometimes you should try to search in several passes with an alternative design.
  7. Do not forget to use the “anchors” of the beginning of the ^ / and the end of the / $ / line, if they can be applied in your case.
  8. The construction of “looking in” back and forth is usually bad for speed. Try to formulate a pattern more precisely.
  9. Do not use the $ ` and $ ' constructs (in JavaScript this is RegExp.leftContext and RegExp.rightContext) - they can slow down parsing greatly.

The main idea - the more precisely you explain what exactly needs to be removed from the top shelf, the less time you stand in front of the counter. Do not forget, it is better to hear at once “we do not trade in beer” in the bakery, than drive the seller for 20 minutes asking about “give me such crap, I don’t know how to explain ...” with the same result. The closer your template will be to the creation of a binary solution, there is NO when applied to the text as a whole, the better.
In terms of private optimization to a specific programming language, everything is obscenely difficult. Try to focus on minimizing the returned data. For example - it makes no sense to perform a replacement statement in order to find out if the pattern matches. And the coincidence itself can be tried to find out by the least “talkative” operator. Suppose there is a .test operator in JavaScript that says only “matched-not matched”. If this is the only thing that interests you - use it, and not .match or .exec. I believe that this advice can be especially valuable for PHP users with its bestiary of regex-like operators.
This article does not pretend to anything but to make the reader (and the writer too) think about the process of optimizing such a powerful tool as regular expressions.
Thinking? Reread Friedl with his “ Regular Expressions ”, and if there is no such book yet, buy it immediately!

')

Source: https://habr.com/ru/post/67091/


All Articles