📜 ⬆️ ⬇️

The subtle virtues of regular expressions in Python

image

The standard Python library has quite a few nightmarish modules, but this cannot be said about the re module. Despite its advanced age and the long-term lack of updates, I consider this module to be one of the best among all dynamic languages.

Python is one of the few dynamic languages ​​in which there is no built-in support for regular expressions, but this is compensated for by a well-developed basic system (from an API point of view). At the same time, he is quite bizarre. For example, the behavior of a parser written in Python may surprise you. If you try to profile Python during the import, then most likely you will spend 90% of the time working with the re module.

Old but proven


The regular expression module in Python was developed a long time ago and is used by default in the standard library. Apart from Python 3, this module has not evolved since its introduction, except for the introduction of Unicode support. To this day, the enumeration of members (member enumeration) works incorrectly in it (see what the dir () function returns to the object of the regular expression pattern).
')
The “old age” of the module plays into its hands in the sense that it does not change depending on the version of Python and has proven its reliability. I often had to redo something just because changes were made to the regular expression module. And if you consider how many of these expressions I have to write, the immutability of the module can not but rejoice.

An interesting detail: the parser and compiler are written in Python, and the matcher - in C. This means that, if necessary, we can transfer the internal structures from the parser to the compiler without full parsing of regular expressions. The documentation does not describe this, but it works.

There are many other interesting points related to regular expressions that are not mentioned or poorly covered in the documentation. So I’ll give here some examples that characterize the regular expressions module in Python from the best side.

Iterative comparison


Undoubtedly, the best feature of regular expressions in Python is the clear distinction between comparison and search. Not all engines can boast of this. In particular, you can use an index to offset a match check, but the match object itself will remain tied to a specific position. For example, you can do something like this:

>>> pattern = re.compile('bar') >>> string = 'foobar' >>> pattern.match(string) is None True >>> pattern.match(string, 3) <_sre.SRE_Match object at 0x103c9a510> 

This is a very useful feature when creating lexers. After all, you can use the ^ symbol to indicate the beginning of the entire line. You just need to increase the index for later comparison. It also means that you can do without cutting the lines themselves, thereby saving a lot of memory and not wasting resources on multiple copying of lines. Although it cannot be said that, in general, according to these criteria, Python stands out for the better.

In addition to the comparison, in Python you can search, that is, skip until a match is found:

 >>> pattern = re.compile('bar') >>> pattern.search('foobar') <_sre.SRE_Match object at 0x103c9a578> >>> _.start() 3 

Mismatch is also a coincidence.


An important problem is the high cost of processing situations where there are no matches. Suppose we need to write a tokenizer for a wiki-like language, say Markdown. Between denoting formatting tokens contains a large amount of text, which also needs to be processed. And when we define wiki-syntax between the necessary tokens, we have to handle even more tokens. How do we solve this problem?

One way is to combine a group of regular expressions into one list and run one after the other. If no match is found, then we skip the symbol:

 rules = [ ('bold', re.compile(r'\*\*')), ('link', re.compile(r'\[\[(.*?)\]\]')), ] def tokenize(string): pos = 0 last_end = 0 while 1: if pos >= len(string): break for tok, rule in rules: match = rule.match(string, pos) if match is not None: start, end = match.span() if start > last_end: yield 'text', string[last_end:start] yield tok, match.group() last_end = pos = match.end() break else: pos += 1 if last_end < len(string): yield 'text', string[last_end:] 

The solution is not perfect and not the fastest. The fewer matches, the lower the performance, since we are moving one character at a time, and the loop is executed in interpreted Python code. Also, this method is not very flexible: for each token we only have matching text, and if groups of tokens are used, the code will have to be reworked.

Is there any more convenient solution? What if you could give the command to the engine to scan for any regular expressions to choose from?

The question is interesting. In general, this is exactly what we do when we write regulars with subpatterns: (a | b). In this case, either a or b is searched. So, it is possible to form one huge one from all the available regulars and compare it with it already. But the downside of such a decision is that in the end we will definitely get confused with all these groups.

Scanner


The last 15 years in the regular expression engine, there is one undocumented tool: a scanner. This is a property of the SRE template object in which the engine, after finding a match, continues to look for the following. There is even an undocumented re.Scanner class that runs on top of the SRE template scanner, which provides a slightly higher level interface.

Unfortunately, the scanner itself, located in the re module, is not very useful from the point of view of speeding up work with “mismatches”. But if we analyze its source code, it becomes clear that the scanner is implemented on top of the SRE primitives. It works as follows: takes the list of regulars and the corresponding callbacks. For each coincidence, he compares a callback with a sample and on the basis of this creates a result list. Under the hood, this is implemented by creating a SRE template and subpattern objects. In principle, it creates a larger regular schedule without parsing it. Armed with this knowledge, what can we do with this code?

 from sre_parse import Pattern, SubPattern, parse from sre_compile import compile as sre_compile from sre_constants import BRANCH, SUBPATTERN class Scanner(object): def __init__(self, rules, flags=0): pattern = Pattern() pattern.flags = flags pattern.groups = len(rules) + 1 self.rules = [name for name, _ in rules] self._scanner = sre_compile(SubPattern(pattern, [ (BRANCH, (None, [SubPattern(pattern, [ (SUBPATTERN, (group, parse(regex, flags, pattern))), ]) for group, (_, regex) in enumerate(rules, 1)])) ])).scanner def scan(self, string, skip=False): sc = self._scanner(string) match = None for match in iter(sc.search if skip else sc.match, None): yield self.rules[match.lastindex - 1], match if not skip and not match or match.end() < len(string): raise EOFError(match.end()) 

For example, here's what:

 scanner = Scanner([ ('whitespace', r'\s+'), ('plus', r'\+'), ('minus', r'\-'), ('mult', r'\*'), ('div', r'/'), ('num', r'\d+'), ('paren_open', r'\('), ('paren_close', r'\)'), ]) for token, match in scanner.scan('(1 + 2) * 3'): print (token, match.group()) 

If the lexical analysis fails, an EOFError error will pop up. But if you specify skip = True, then parts that are not amenable to analysis will be skipped, which facilitates the process of creating such things as wiki-syntax lexical analyzers.

Interval scanning


We can use match.start () and match.end () to denote the sections that should be skipped. Example:

 scanner = Scanner([ ('bold', r'\*\*'), ('link', r'\[\[(.*?)\]\]'), ]) def tokenize(string): pos = 0 for rule, match in self.scan(string, skip=True): hole = string[pos:match.start()] if hole: yield 'text', hole yield rule, match.group() pos = match.end() hole = string[pos:] if hole: yield 'text', hole 

Fix groups


Annoying is the fact that our group indices are not local to our own regular expression, unlike the combined one. That is, for example, one cannot refer to a group with an index if there is a rule (a | b). To do this, you have to tinker with the class, acting as a wrapper for the SRE-sample, allowing you to customize the indexes and group names. If you are interested in the details, you can explore the implementation of such a wrapper on GitHub 'e, along with some examples of use.

Source: https://habr.com/ru/post/274349/


All Articles