Some errors when writing regexp

Based on the translated article

For the first time I saw regexps while still ~~in school~~ in a pearl, and in general I loved them at first sight, of course, after I figured out what it was :). And with great enthusiasm he began to rehearse everything. Of course, at the same time, I poked my cones on the brain, but I did not stop loving them. Over time, any sincere love matures and develops into a deep affection, with the understanding that the object of your feeling may be imperfect, but no less loved by it.

So, several ways to protect yourself from disappointment in this powerful and beautiful tool ...
')

Use only those regexps that you wrote yourself.
.

Each expression is written for a specific task. Unfortunately, the task of the author of the expression and the programmer using it may and will most likely be different.
You can forget about this rule, but in the process of debugging it will still have to be executed.
Remember the meanings of meta characters in context

It is very important to know and remember what the main symbols mean and in what context to use them. For example, the sign "^" at the beginning of the expression and in the enumeration "[..]" has a different meaning.
Unfortunately, perseverance when memorizing tables disappears after the multiplication table and to drive a few tables into the head from the help comes only in the form of stuffed cones.

As a sign that you remember and successfully apply this rule, this is the correct escaping of meta characters. Such a [a-z0-9\.-] experienced regex spy will not write. Because the point in the list is just a point and it does not need to be escaped, but the hyphen in the list is a meta symbol. In this example, the hyphen stands at the end and is correctly recognized as an ordinary character, but it is enough for another copy-pasteur to “improve” it by adding one more character to the enumeration, at best there will be an interval error, and at worst we will get an uneasy bug: for example in this [a-z0-9\.-\/] parser will [a-z0-9\.-\/] "\ .- \ /" as an interval and not as 3 characters, as it would be expected.
Think about how the parser works. Try to help him.

Of course, modern parsers are quite complex to be the subject of public attention, but in principle they work logically. If you have written several patterns in succession, they will be searched in a row. And the best way to do this is to enter the patterns one after the other and not to forget about the separators. Type this:
/^<[^>]+>$/
where in reality there are three templates:
- character "<"
- no ">" characters one or more than one
- symbol ">"
And you should give the parser as many stability islands as possible - specific “http” characters, in extreme cases, limit it to the enumeration "[\ w \ s] +" or give us several options "(http) | (ftp)". And the less will be ". +" And "*" the more stable and faster the result will be.

if suddenly you decide that it will be better:
/^<.+>$/
then you will most likely have fun problems. Since the standard default implementations are “greedy”, then everything inside the tag will be from the beginning of the first tag to the end of the last one. More on this later.

Yes, and try to avoid uncertainties. for example
/([az]+)([^<]+)*>/
can break a sequence of letters in any way. Again, since by default the “greedy” quantifiers, then the first pattern will “correctly” get the word to the first non-letter and the rest to “>” in the second. Perhaps it would be more correct to write
/([az]+)([^az<>]*)>/
but going back to the first rule - it depends on the task.
Do not use regular expressions for parsing with nesting.

The fact is that regexps are just a subclass of algorithms and not all algorithms can be implemented on them, unlike recursion, for example. A striking example is nested structures.

For example, take a bit of html:
texttexttext
as you can see
- nested tags
- consecutive tags
if you can deal with consecutive tags using the “non-greedy” quantifiers or by stopping the expression at the beginning of the next (it was higher than the “[^ <] +” type), then it is almost impossible to achieve nesting. That is, for specific cases it is realizable, but there will always be an option that will break our slender picture.

As a special case, you should not use exclusively regexps for syntax highlighting - errors cannot be avoided. This does not mean that in such tasks there is no use for our favorite regexpas - for example, it is convenient to extract the markup from the text, since it must be atomic.
Regular expressions are parsing a string. Everything else is better done differently.

We take the code "validation" IP
/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

Why parse the numbers? in this example, the addresses 0.0.0.0 and 255.255.255.255 will be valid, not to mention other multicast ahs.
In such cases, it is better to do a simple filter on openly invalid input (such as SQL injection) /^\d+\.\d+\.\d+\.\d+$/ and what's behind the IP is to separate it separately

Usually such articles are called “10 poses that everyone should know” ... I deliberately replaced <ol> by <ul> because I am not God, I am only learning.
I wrote only what I remember and what I’ve been trying to do. I think that everyone with comparable experience can demonstrate their own bumps.

Source: https://habr.com/ru/post/67158/

All Articles

Some errors when writing regexp

Use only those regexps that you wrote yourself.

Remember the meanings of meta characters in context

Think about how the parser works. Try to help him.

Do not use regular expressions for parsing with nesting.

Regular expressions are parsing a string. Everything else is better done differently.

More articles: