Improve regular expressions

After reading the book about regular expressions (hereinafter simply RV), I had some thoughts about their readability. When the RVs only appeared, and there were quite a few symbols like \ d, \ w and the like, then everything was probably not so bad, although even then it was worth thinking about visibility. Now reading code from PB is a silent horror. No, if the PB is short, then there are no special problems, but as they become more complex and different brackets appear, everything becomes horrible. The situation is aggravated by the fact that in some languages (we will not point fingers) constantly have to double slashes.

In addition, in the notation RV, which is now used in most programming languages, in some seemingly simple situations, you have to get out with the help of various tricks. The first example that came to mind is to create a regular expression if “abc”, then NOT “xyz” .

')
In my opinion, it’s time to abandon the notation that is being used now and create a new one that will be closer to the usual programming language, because the RV notation is in fact a language, but the design is just awful. The worst thing in today's notation is the abundance of parentheses like (...) , (: ...) , (?: ...) , (? = ...) , (?! ...) , (? <= ...) , (? <! ...) , (? <) . It is thanks to them that the expressions become confusing and it is impossible to cover all PB's views, to immediately say what it is looking for, and you have to check every character in a string, without forgetting, for example, that ^ in the middle of a PB is the beginning of a line, and at the beginning of square brackets [ ^ ...] is an inversion. Well, it’s wrong that when a new opportunity arises, developers create a new designation, some (& ^% $ # @ ...) , which in itself says absolutely nothing.

After all, what is the beauty of conventional programming languages (without taking any extreme cases)? If we see an if or while operator in an unfamiliar language, then we can immediately say that it does about it. Yes, you can replace these operators with symbols like @ # $% and # & $ ^ respectively, you can even get used to them, but as stated in the anecdote about the lesson of the Russian language in Georgia, "you need to remember this, it is impossible to understand."

Perhaps the situation could be improved by clever code editors, who in the regular expression would differently highlight the brackets (?: ...) , (? = ...) , etc., to immediately see the areas of their actions, but for Most programming languages do this almost impossible, since PB there is a line and the editor would have to be able to determine from the content of the line what is in front of it: PB or plain text. And anyway, with a big nesting of the RV brackets, it will turn into a multi-colored rainbow.

Generally speaking, quite good changes in terms of readability of RVs have already occurred due to the appearance of regimes when RVs are recorded on several lines, as well as thanks to comments inside RVs. There was even a construction with a clear view (? If ...), in Perl (it will not be mentioned by night) the program code can be embedded in a regular expression, and in .NET instead of a simple replacement by RV, the replaced value can be generated using ~~specially trained~~ special delegate. In general, you can even write a more or less understandable RV, but still it’s not that, it’s more like crutches.

It is time to create a language RV, similar to the other "human" programming languages, rather than Brainfuck. Then it would be possible to organize a clear highlight, prompts a la IntelliSense, and in the future, perhaps, step-by-step debugging of the PB.

Then I would like to show what I would like to see RV.

First, they must somehow be separated RVs from ordinary strings. It is clear that the functions for their work require exactly strings, I am not sure that RTs should be embedded in the languages themselves, as is done in Perl, even if they remain strings, but in order to somehow identify them inside quotes, you should use some additional notation. This can be anything, for example, instead of "\ d \ w" (for clarity, I will not double slashes) you should use "! \ D \ w!" or "<\ d \ w>" , then the editor can easily distinguish PBs from lines. In the future, I will use the record "! ...!", But this is not important, like the other notation, the main thing is the essence.

Secondly, RVs need to be written only in the mode when spaces and line breaks are ignored, and to separate literals inside the expression, which always remain the same from the constructions of the RVs themselves, you can put quotes in literals (no matter what). For example, instead of “abcd \ d \ wxyz” you can write:

"! 'abcd' \d\w 'xyz' !"

Or even "! 'Abcd' \ d \ w 'xyz'!"

The code editor here can separately tint abcd and xyz . It may be worth using the “+” sign to link these pieces. So it will even be clearer: "! 'Abcd' + \ d \ w + 'xyz'!" because separate parts of RVs are more visually separated.

You may be confused by the fact that the "+" sign is now used in the meaning of "1 or more coincidences", but this is not terrible, because no one is going to use it anymore in this value. This is not logical. There are such visual constructions as {min, max}, let's use them together with the operator "*". The operator "*" should be used just in the meaning of "multiply", that is, the expression "! 'Abc' * 3!" means that the string 'abc' should repeat 3 times. PB "! 'Abc' * {1, 3}!" means that abc should repeat from 1 to 3 times. Similarly, you can use the entry "! 'Abc' * {1,}!" in the meaning of "1 or more matches" instead of "+", and instead of the operator "*" write: "! 'abc' * {0,}!" . And the entry "! 'Abc' * {3, 3}!" equivalent to what we have already seen "! 'abc' * 3!" . The old operator "*" will then be replaced by the expression "! 'Abc' * {,}!" .

Perhaps instead of curly brackets it is worth using square or round ones, then it will be even closer to the mathematical record of segments and intervals.

The question remains with how to denote the minimal operator "*" (it is not greedy). It would be possible to use the division operator, but this is also not logical, therefore it can be written directly in the form of "! 'Abc' min * 3!" . Here min * is one operator without a space. I don’t really like this version of the record, but at least he explains the essence with his name.

Most brackets should be replaced with built-in functions. For example, instead of "[abc]" you should write in the form "! Any (a, b, c)!", Then you can replace the expression "(: abc) | (: xyz)" with "! Any (' abc ',' xyz ')! " and we can also get rid of the operator "|". You can use PBs as function parameters, for example, "! Any (\ d \ w, 'abc')!" .

It is necessary to decide how to deal with the simplest expressions like \ w, \ b, \ d, etc. On the one hand, they are quite compact, but, for example, I like the record, which can now be used in square brackets - [: alnum:]. For convenience, you can replace them with a record of the form _alnum_. Or maybe the simplest \ d and \ w should be left as is. And instead of ".", Which is not particularly visible in the text, you can use the record _any_. The same spaces and tabs that are ignored in the expression itself can be written in the form of _space_, or simply put them in quotes.

It is necessary to enter the normal if - then - else operator, the essence of which is that if the expression after the if is executed, then the RV is checked in the then branch, otherwise after the else branch. I think that the word then can be omitted. Then it will be possible to make such RV:

"! 'abc' if (\w * 3) { 'xyz' } else { \d * {1, } 'klmn' } !"

Here I used the syntax in C-like languages, but this is not critical. Literally, this expression means: First comes the string 'abc', then the PB is checked '\ w * 3', if it is executed, then 'xyz' should go, otherwise at least one number should go, and then 'klmn'.

It may even be worthwhile to introduce operators of type case , while, and for . In addition, you must enter the logical operations AND, OR, NOT to use them in the condition. Not sure about AND and OR, because the expression "! If ('abc' && 'xyz')!" is equivalent to "! if ('abcxyz')!" , and "! if ('abc' || 'xyz')!" - "! If (any ('abc', 'xyz'))!" . But the negation operator is needed precisely to determine what should not be in a given place.

You need to enter a variable that indicates the position in the line where the search is currently being carried out (let’s be _pos_ variable), as well as a variable that stores the line itself, to which the PB is applied (_this_). Then the operator "^" can be replaced by a more understandable "! _Pos_ == 0!" , and "$" on "! _pos_ == (strlen (_this_) - 1)!" It may be worthwhile to introduce a separate designation for the end of the line, for example, by analogy with Python: _pos_ == -1. The same variables will allow for advanced and retrospective verification.

Need to leave comments. What they will look like is no longer important.

The assignment operator should work in two modes. The first is to check and assign a variable to the string corresponding to a regular expression, for which an entry like "(? <Foo> ...)" is now used: "! Foo = \ w \ d *;!" . Semicolons will have to be used to show where the assignment statement ends.

The second assignment mode is to save the regular expression without checking it. Used for clarity, for example,

"! foo = !\d\w*! 'abc' foo 'xyz' foo !"

Here is the expression! \ D \ w *! (note the exclamation marks) is then used by the variable name foo.

These are the main ideas that have appeared on PB. It would be interesting to try such expressions in action, but, unfortunately, my hands are unlikely to reach the implementation of such a parser. In general, one could begin with the fact that such expressions were converted to the classical type of RV, and then processed by the finished library.

Finally, a small example for finding a URL. Perhaps, not everything is taken into account there, for example, it is believed that the domain zone can only be com, net, info or two-letter.

"! unicode = !% any(\d, AF) * 2 ! // Unicode . // , domain = !any ('com', 'net', 'info', (az) * {1, 2})! host = !any (\w, '_', unicode)! "http://" (host '.') * {1,} domain '/' * {0, 1} "!

I hope that I was not mistaken anywhere, but even if I was wrong, it is not terrible, the main thing I wanted to show the essence.

In conclusion, I will say once again that the main goal of all this was to figure out how to increase the readability of PB. Of course, with this the amount of typed text will increase, but for large RTs it is worth it.

Source: https://habr.com/ru/post/62664/

All Articles

Improve regular expressions

More articles: