📜 ⬆️ ⬇️

The subtleties of regular expressions. Part 1: Metacharacters Inside and Outside Character Classes

Instead of intro



Anyone who has ever written a program knows that there is such a miracle in the world as regular expressions. Some cannot pass without them a step, some fear them as fire, but it is extremely difficult to imagine a modern programming language without regular expressions.

What happens when a novice programmer finds out about regular expressions for the first time? Most often, the first acquaintance with them occurs through the method of “scientific spear”, since there is no knowledge in the relevant field or understanding “how it works” usually at this stage. Why it happens?
')


It's no secret that many people recall Perl when they mention regular expressions. And for good reason! Perl is one of the few languages ​​where regular expressions are fixed at the level of syntax, the basic constructs of the language. At the same time, Perl became famous as a language where it is very difficult to understand programs 5 minutes after writing. The abundance of one-two-character functions and variables does its job. Text looks more like a set of emoticons than a program. Especially if it uses regular expressions.

But I seem to have deviated from the topic, never having had time to begin. So, I will assume that you already know what regular expressions are and why they are used. Let's move on to more interesting things.

Regular Expression Dialects



Historically, regular expressions were originally (and still are) developed without strict standardization. Of course, this gave rise to many discrepancies in syntax and semantics. At the moment, the syntax of regular expressions is very similar, one can even say close, but there are still differences.

Of course, in such an important matter as standardization has not been without the omnipotent POSIX. Especially since regular expressions originate in the unix environment.

POSIX describes the syntax and semantics of regular expressions. There are two main standards: POSIX BRE (Base Regular Expressions) and POSIX ERE (Extended Regular Expressions). They differ, as the name implies, in that the second standard extends the first. I will not describe in detail what is included in each of the standards, and especially the semantics of what is included in them, since this can always be seen in Wikipedia. I can only say that despite the fact that there are such standards, developers of regular expression engines are not in a hurry to follow them completely. Especially in semantics! And for good reason.

So, what is the difference between regular expressions dialects in different languages ​​and utilities? Mainly, these are, of course, metacharacters (characters that are interpreted in a special way, not as their literal meaning).

For example, consider the very common metacharacter . (point). Probably everyone who has come across regular expressions at least once knows that this metacharacter means “any character”. Yes, but it doesn’t mean it! The “dot” metacharacter is interpreted as “any character except the end of the line”. But again, not everywhere. In some languages, the default interpretation is the same; in others, just “any character”, in many there are modes for both that and that interpretation.

The next frequent difference is in the interpretation of parentheses. Figured, round, square. Somewhere brackets need to quota, somewhere not. For example, in .NET, Java brackets should be quoted, because these are metacharacters. In the grep utility, by default, you do not need quota brackets! And in order to use the functionality of groups and others, you need to use expressions like \(\) .

Metacharacters inside character classes



And right away, before we forget about metacharacters, consider character classes. A very common mistake for beginners is quoting metacharacters inside character classes. Such an error often does not have any consequences (often), but it clearly shows that a person does not fully understand how character classes work.

Anyone using a regular expression met character classes. I'm sure of it. For those who have forgotten what it is, let me remind you - character classes are sequences within square brackets, if to speak in the language of an amateur. Example: [abc0-9] - in the place where the character class is located in a match, the character a or b, or c, or a digit from 0 to 9 must be present. Everything is simple.

But not as easy as we would like. The first thing to remember: the character class is a different world! As soon as you get inside the square brackets, all the rules of the game change. Some metacharacters cease to be such, the semantics of others change radically. Not to be unfounded I will give examples:


I will not consider other metacharacters for lack of space. Now it becomes clear why it is unnecessary to write [\.\(\)\{\^] . Just because these metacharacters inside a character class are no longer as such. And quoting them "just in case", you yourself show that you do not really understand what is happening inside.

The article turns out unexpectedly large. I wanted to write about the differences in the implementation of regular expressions, the differences in the semantics of implementations, the differences in the interpretation of character classes and in general the way they are interpreted. Therefore, I think that for the time being I will leave the article in this way, and if you like it, I will write the following.

Based on the book by Jeffrey Friedl, Mastering Regular Expressions .
Part 2 .

Source: https://habr.com/ru/post/112016/


All Articles