📜 ⬆️ ⬇️

Regexp is a “programming language.” The basics

A few years ago, I thought that regexp performs a linear search on the text, but what a surprise I was when I realized that it was not. Then I was convinced from my own experience that the result of a simple change of a and b in the scheme (... a ...) | (... b ...) was completely changed.

So now I’ll tell you how regexp actually works.
By understanding these simple principles and how it works, you can write any queries.
For example, I will analyze the complex at the first approximation, but in fact the simplest task is to identify all strings in quotes.


Branches in regexp

Regular expressions work according to the scheme of a directional tree.
For example, the branch (a ...) | (b ...) | (c ...) can be written:
If (char=='a'){
...
}else if (char =='b'){
...
}else if (char =='c'){
...
}



')
That is why rearrangement of the 'b ...' 'a ...' places affects the result - because execution priorities are set here (in what order you put it, it will be done so).

That is why it is worthwhile to ensure that the priority is given and to try to make the conditions optimally supplant each other.
And everything that for 'a' will be carried out provided that everything before 'a' (including 'a') is fulfilled.

We write quotes parser

Consider a simple example.

Let's take the test string 123"ABC\"D\EF"GHI""\"KEY and start scoffing at it:

The first thing that appears in the head is /".*"/ expression. It will act according to the algorithm:
1) Search for the first one "
2) While we have any character (including " ), we go further.
3) In the end should be too "
As a result, correctly, we got "ABC\"D\EF"GHI""\" .
That is, found the first quote. Further, while the condition was fulfilled, the following symbols were taken (including " ) and were done until the last one was " .

Now let's improve the algorithm - let's make it look for any character, excluding " .
Our regular expression has become /"[^"]*"/ . It will act according to the algorithm:
1) Search for the first one "
2) As long as we have any symbol not equal " , we go further
3) Stumbled upon " - the end.
The result has become more accurate - "ABC\", "GHI", "\" were chosen.



Now we need to define the characters \" leave them inside, not counting this as the end.
To do this, you must change the condition [^"] , adding there another comparison with \" .
It will now look like this - /"([^"]|(\\\\"))*"/ .

We added the condition \\\\" . Why four '\' ? Because every two \\ in the line = one \, that is, we wrote down \\ in the query string, and the regexp uses the expressions \ w, \ d, \ s and so on, to put one \, you must use \\.

But our expression will not work yet.
Why not? It will not work, because the condition [^"] occurs first, and then, if it is not met, then it is compared with \" :
1) Search for the first one "
2) As long as we have any symbol not equal, " we go further,
if it is equal to " (the previous condition was not fulfilled), we compare it to c \ (it is not equal to itself)
3) Stumbled upon " - the end.

Therefore, it will be correct to swap the conditions - / "((\\\\") | [^ "]) *" /, so that \" checked first, and then any other character is not " .
Now everything works correctly and the result selects "ABC\"D\EF", "" . It looks like magic, right? The algorithm worked correctly.



Immediately I will say that the option [^"\\\\]|(\\\\") does not fit, because when the algorithm finds \, it will go to the second condition \" (for \ must be " ) that will not be executed in our case \ E. For this, it will be necessary to put a condition - if after \ goes " , then we skip the symbol, otherwise we go further. That is, the expression will take the form /"([^"\\\\]|(\\\\(")?))*"/ .

Improving the algorithm

Let's add a parsing character to '.
In regular expressions, you can use the characters found in future checks - we will use this:

We start our expression by searching for any quotation mark / apostrophe /(["'])... - the found quote will fall into the special variable \ 1, which we can use further in the test. In this case, we will get there one character - either " or'. In the end, in order to check the closing of this quotation mark, you must use ...(\1)/ . Inside, check not for the absence " , but for the absence of \ 1.

We optimize the code a little and get /(["\'])(\\\\\1|.)*?\1/ ( ""''you) (( /(["\'])(\\\\\1|.)*?\1/ .) /(["\'])(\\\\\1|.)*?\1/ ? \1 /(["\'])(\\\\\1|.)*?\1/ . It should be noted that I used ? (Lazy) in the expression - to add the last \ 1 - condition that is, now everything else is still checking on. "
Why did I do this instead of [^ \ 1]? Because \ 1 does not work in [].

Now the code does the following:
1) Search for the first one " or' (write it to \ 1)
2) The next character " or'?
if not, then the next two characters are \" or \' (depending on the beginning)
if not, then just skip the symbol
3) Stumbled upon " - the end.
And the expression 1'2'a3"A'BC\"DEF"GHI""\"KEY parsed to '2', "A'BC\"DEF", "" .

This expression can be used to select string regions inside any objects.
For example, function:
function a () {
b = "{}" ;
}

Add curly brackets to the expression / /{((["\'])(\\\\.|[^\\\\])*?\2|.)*?}/ {b="{}";} /{((["\'])(\\\\.|[^\\\\])*?\2|.)*?}/ {b="{}";} . Since one more brackets appeared in the expression, \ 1 moved to \ 2 - be sure to follow this.

Upd. I forgot to mention the reverse. There is such a move, when the algorithm finds nothing, moving straight :). By this it is better instead . use [^ \\\\]. (see the last example) In this case, the location in the string "\" does not happen, as it should be.

Source: https://habr.com/ru/post/55766/


All Articles