Regular expressions inside bash

I did something like optimizing the speed of my script. The algorithm was already fully polished, parallelized, and had already been executed for more than a reasonable time. Only occasionally, licking parts of the code, shaking up the places using external commands and leading into fragrant harmony with shell built-in commands, drew attention to the stagnant role of the worker - the stream sed editor, who was still diligently processing regular expressions in my burgeoning script.
There are many places where people gnaw each other's throats and defend the honor of their favorite editor in a formidable war sed vs awk vs grep vs ...
However, most people know that replacing external commands with internal commands often significantly speeds up critical points in a script, makes the author smile, spending less of his time waiting for a cup of coffee “until processing ends. This is, in a sense, some inadequacy if he knows C language and can significantly speed up the program by rewriting the code on Syah; but do not immediately write it into crazy - some scripts are quite voluminous to carry code and use various commands, causing the code to swell from the cheeky inserts of the exec () system calls.
So, be that as it may, the third version of bash developers gave us the ability to use built-in regular expressions within the command [[with = =.
Most of the results on googling about this ability of bash endure the same verdict - “use regular expressions inside bash - moveton”.
In this article I will try to make a verdict how bad everything is (and it really is somehow not good).

Data to be processed

After some experiments, I still have a database of cars containing data about the make, model and their modification in CSV format. The file is large enough for fans of e-books in text formats, approximately 1 MB, i.e. provides some space for imagination and invention of its sample, allowing you to also evaluate the performance of regex on files larger than short ones.
So, suppose that I am a gardener Frank, who decided to buy a Bugatti (the choice fell on her due to the fact that there are only 4 cars of this brand in a somewhat old base; and besides, what gardener doesn’t want to go out with her seedlings not overcrowded bus, and from the famous brand, charming local beauties).
In the used database, the machines are sorted by brand (although, for the validity of tests, the sorting was subsequently performed by model year. The database fragment used in the tests:

"";" ";" (/)";" ( 100/)";" (^3)";" (../. )";" ( 100 )";" ";" ()";" ()";" ()";" ";" ";" ";" ()"
...
"Brilliance M3 1.8";"-";"210";"10.1";"1793";"170/5500";"7.7";"";"4488";"1812";"1385";"2008";"3";"4";"400"
"Bugatti Veyron EB 16.4";"43 968 000";"407";"3";"7993";"1001/6000";"24.1";"";"4465";"2000";"1205";"2006";"2";"2";"-"
"Bugatti EB 110 GT";"-";"340";"3.6";"3500";"559/8000";"14.3";"";"4400";"1960";"1125";"1991";"2";"2";"72"
"Bugatti EB 110 S";"-";"350";"3.4";"3500";"620/8000";"13.5";"";"4400";"1960";"1125";"1991";"2";"2";"72"
"Bugatti EB 112 6.0";"-";"300";"4.7";"5995";"461/6300";"18.2";"";"5070";"1960";"1405";"1993";"4";"4";"365"
"Buick Enclave CX";"-";"180";"-";"3564";"279/6600";"-";"";"5126";"2006";"1846";"2007";"5";"8";"535-3259"
...

The task of the experiment is to estimate the speed of sampling lines with the Bugatti brand with regular expressions embedded in bash. Lazy people will notice that this can be done without regexp with one command:
grep Bugatti auto.csv
But, the situation invented for the test, and not for real use - really, what gardener is enough for Bugatti?

Testing method

Performance comparison consists in comparing the results of the time command for a function that uses built-in regular expressions and for a function that uses the stream editor sed. (it is possible to choose any other, but I like it).
For simplicity, the subjects receive the data already read from the file in the first parameter, and write the results to the global tmp array.
So, the function that uses the features of sed is:

function test_sed()
{
OLD_IFS=IFS
IFS=$'\n'
tmp=($(sed -n '/.*\(Bugatti[^\n]*\)/s//\1/p' <<< "$1"))
IFS=OLD_IFS
}

It should be noted that the pattern of a regular expression provides an open space for imagination, you can use symbols denoting the beginning and end of a line, or even come up with a completely different pattern.
The average test result on my equipment is as follows:

real 0m0.805s
user 0m0.719s
sys 0m0.082s

Regular expressions embedded in bash are expected to significantly reduce the time spent on system calls and some decrease in the total test time.

Function test_bash_rematch

Some time was spent writing a function that tests the regexp in bash, so you should describe the obstacles I encountered.
A general view of the pattern search in bash of the third version looks like this: [[ $str =~ "$regex" ]]
The result of the command is 0 if the expression matching the pattern is found, 1 if not found and 2 if the pattern is incorrectly written. The found match with the template is written into the array S {BASH_REMATCH} (with index 0 - for the part that matches the whole template and group indices, in the order of their appearance in the template.
The first obstacle encountered - starting with a certain version of the template, it is not necessary to enclose the quotes, which I overlooked in man bash.
The second pitfall is using POSIX for regex, and the greedy and lazy quantification doesn't work for me.
And finally, the whirlpool, which significantly slows down the progress - the search by template can be either the first match with the pattern, or any subsequent one (this is not a stream editor that searches the file line by line before the first match).
As a result, testing was conducted using the following two functions:

function test_bash_rematch_single()
{
[[ "$1" =~ (Bugatti[[:alpha:][:digit:][:punct:][:blank:]]*) ]] && tmp[0]="${BASH_REMATCH[1]}"
}

Looks for one pattern match, averaged test results:

real 0m0.678s
user 0m0.624s
sys 0m0.030s

Better than sed, but it’s understandable — so far only one match with the pattern.
And the second function using the read builtin and the loop:

function test_bash_rematch_while()
{
i=0
while read line
do
[[ "$line" =~ (Bugatti.*) ]] && tmp[i]="${BASH_REMATCH[1]}" && ((i++))
done <<< $1
}

With results:

real 0m1.523s
user 0m1.360s
sys 0m0.030s

Conclusion

As can be seen from the results, the time spent on processing system calls when using the built-in bash regular expressions is really reduced. But, at the moment, bash version 4.1.5 (1) searches the template very slowly, which eliminates the use of the built-in bash regexp in critical places, and where execution time does not matter, you should also not use the built-in bash regular expressions, as it reduces the portability between the shells, but does not give pluses.

PS

It is also possible to implement a pattern-based search function as recursive, when finding a pattern that breaks the source text into two parts - before and after the match with the pattern and passes them to itself by recursion, it is possible (if run in separate streams in the background), it will work faster, but the time for processing system interrupts will increase and we will change the hamburger for an enema.

Source: https://habr.com/ru/post/128059/

All Articles

Regular expressions inside bash

Data to be processed

Testing method

Function test_bash_rematch

Conclusion

PS

More articles: