The subtleties of regular expressions. Part 1: Metacharacters Inside and Outside Character Classes

Instead of intro

Anyone who has ever written a program knows that there is such a miracle in the world as regular expressions. Some cannot pass without them a step, some fear them as fire, but it is extremely difficult to imagine a modern programming language without regular expressions.

What happens when a novice programmer finds out about regular expressions for the first time? Most often, the first acquaintance with them occurs through the method of “scientific spear”, since there is no knowledge in the relevant field or understanding “how it works” usually at this stage. Why it happens?
')

It's no secret that many people recall Perl when they mention regular expressions. And for good reason! Perl is one of the few languages where regular expressions are fixed at the level of syntax, the basic constructs of the language. At the same time, Perl became famous as a language where it is very difficult to understand programs 5 minutes after writing. The abundance of one-two-character functions and variables does its job. Text looks more like a set of emoticons than a program. Especially if it uses regular expressions.

But I seem to have deviated from the topic, never having had time to begin. So, I will assume that you already know what regular expressions are and why they are used. Let's move on to more interesting things.

Regular Expression Dialects

Historically, regular expressions were originally (and still are) developed without strict standardization. Of course, this gave rise to many discrepancies in syntax and semantics. At the moment, the syntax of regular expressions is very similar, one can even say close, but there are still differences.

Of course, in such an important matter as standardization has not been without the omnipotent POSIX. Especially since regular expressions originate in the unix environment.

POSIX describes the syntax and semantics of regular expressions. There are two main standards: POSIX BRE (Base Regular Expressions) and POSIX ERE (Extended Regular Expressions). They differ, as the name implies, in that the second standard extends the first. I will not describe in detail what is included in each of the standards, and especially the semantics of what is included in them, since this can always be seen in Wikipedia. I can only say that despite the fact that there are such standards, developers of regular expression engines are not in a hurry to follow them completely. Especially in semantics! And for good reason.

So, what is the difference between regular expressions dialects in different languages and utilities? Mainly, these are, of course, metacharacters (characters that are interpreted in a special way, not as their literal meaning).

For example, consider the very common metacharacter . (point). Probably everyone who has come across regular expressions at least once knows that this metacharacter means “any character”. Yes, but it doesn’t mean it! The “dot” metacharacter is interpreted as “any character except the end of the line”. But again, not everywhere. In some languages, the default interpretation is the same; in others, just “any character”, in many there are modes for both that and that interpretation.

The next frequent difference is in the interpretation of parentheses. Figured, round, square. Somewhere brackets need to quota, somewhere not. For example, in .NET, Java brackets should be quoted, because these are metacharacters. In the grep utility, by default, you do not need quota brackets! And in order to use the functionality of groups and others, you need to use expressions like \(\) .

Metacharacters inside character classes

And right away, before we forget about metacharacters, consider character classes. A very common mistake for beginners is quoting metacharacters inside character classes. Such an error often does not have any consequences (often), but it clearly shows that a person does not fully understand how character classes work.

Anyone using a regular expression met character classes. I'm sure of it. For those who have forgotten what it is, let me remind you - character classes are sequences within square brackets, if to speak in the language of an amateur. Example: [abc0-9] - in the place where the character class is located in a match, the character a or b, or c, or a digit from 0 to 9 must be present. Everything is simple.

But not as easy as we would like. The first thing to remember: the character class is a different world! As soon as you get inside the square brackets, all the rules of the game change. Some metacharacters cease to be such, the semantics of others change radically. Not to be unfounded I will give examples:

metacharacter ^ - out of a character class metacharacter means “the beginning of a line” or “the beginning of a logical line” depending on the mode of operation And inside a character class, this metacharacter denotes the inverse of a character class. Notice, I did not say "no coincidence," because it is not. When we invert a character class, the semantics of its work is “there must be a character that is not in the character class”, and not at all “there must not be a character that is in the character class”. The differences in semantics are huge. Consider, for example, the regular expression ^abc[^abc] in relation to the string abc . In the first case (correct interpretation) there is no coincidence! Because "empty" can not coincide with the symbol. And in the second case there should be a coincidence, because the symbol is there (at the 4th position of the string) just not.

But I digress. So, the same metacharacter is interpreted completely differently depending on whether it is located: in a character class or outside it. But that's not all! Character class inversion occurs only if the ^ metacharacter is the first character after the opening square bracket! Those. in the [abc^] character class, there is no longer any inversion and the cap is just a cap.
metacharacter - - out of character class is just a hyphen. He is not a metacharacter. But inside the character class it denotes a range. But there is a nuance. If this symbol comes immediately after the lifting of the square bracket, then naturally it cannot indicate a range. And then it is interpreted as ... just a hyphen. Like out of character class.

Another common mistake with the metacharacter - - setting the wrong range in a character class. For example, [aZ] , everything is clear here - instead of all lowercase and uppercase Latin letters, we get all the characters from 0x61 to ... 0x5A (in ASCII encoding). Those. empty set (in some dialects we get only the symbols a and Z). Therefore, it is again very important to know the semantics of the hyphen - characters that have codes that are located between the codes of the beginning and end of the range, inclusive, fall into the range. I did not encounter languages that would interpret ranges in a special way (for example, as a character class \w or \d ).

I will not consider other metacharacters for lack of space. Now it becomes clear why it is unnecessary to write [\.\(\)\{\^] . Just because these metacharacters inside a character class are no longer as such. And quoting them "just in case", you yourself show that you do not really understand what is happening inside.

The article turns out unexpectedly large. I wanted to write about the differences in the implementation of regular expressions, the differences in the semantics of implementations, the differences in the interpretation of character classes and in general the way they are interpreted. Therefore, I think that for the time being I will leave the article in this way, and if you like it, I will write the following.

Based on the book by Jeffrey Friedl, Mastering Regular Expressions .
Part 2 .

Source: https://habr.com/ru/post/112016/

All Articles

The subtleties of regular expressions. Part 1: Metacharacters Inside and Outside Character Classes

Instead of intro

Regular Expression Dialects

Metacharacters inside character classes

More articles: