Good evening, habrazhiteli! Today I want to share a small performance assessments ORegex .NET.
If you read my previous article here , in my opinion it was not very convincing to represent something without a comparative assessment of speed, don't you think so? If yes, then you are under cat.
Long thought about what to include in the benchmark, so as not to torment yourself for a long time. I didn’t want to spend a lot of time / effort on this, and since I’m a science fiction writer and I’m quite useless, I decided to take the simplest comparison - search for a substring in the text. In my opinion for the performance test is the best solution, because the symbol is also an object. It is worth noting that all the code is present in the test project on the git hub , you can run it if you just have nothing to do =)
We have ORegex and Regex engine from Microsoft, all you need is to describe similar patterns for both sides, and lambda for each character for ORegex, and you need to remember to come up with test cases. It did not take long to think, it was decided to see how the tools would cope with three tasks:
A page from the news site was selected and the task was set to find all p tags in 20 iterations. As expected, the Regex is ahead. Both tools in the initial iteration showed approximately the same cold start results. After the first iteration, the difference in speeds differed by about 10 times:
ORegex pattern: {b1o} {p} {b1c}. *? {B1o} {slash} {p} {b1c}; Regex pattern: <p
>. *? </p
>
No | Oregex | Regex | Ratio |
---|---|---|---|
one | 00: 00: 00.0040204 | 00: 00: 00.0058571 | 1.46 |
2 | 00: 00: 00.0030944 | 00: 00: 00.0003172 | 0.1 |
3 | 00: 00: 00.0032093 | 00: 00: 00.0003195 | 0.1 |
four | 00: 00: 00.0031040 | 00: 00: 00.0003172 | 0.1 |
five | 00: 00: 00.0032354 | 00: 00: 00.0003149 | 0.1 |
6 | 00: 00: 00.0031703 | 00: 00: 00.0003153 | 0.1 |
7 | 00: 00: 00.0031220 | 00: 00: 00.0003187 | 0.1 |
eight | 00: 00: 00.0030883 | 00: 00: 00.0003187 | 0.1 |
9 | 00: 00: 00.0036790 | 00: 00: 00.0003674 | 0.1 |
ten | 00: 00: 00.0030902 | 00: 00: 00.0003145 | 0.1 |
eleven | 00: 00: 00.0030787 | 00: 00: 00.0003130 | 0.1 |
12 | 00: 00: 00.0030752 | 00: 00: 00.0003149 | 0.1 |
13 | 00: 00: 00.0030975 | 00: 00: 00.0003183 | 0.1 |
14 | 00: 00: 00.0032250 | 00: 00: 00.0003777 | 0.12 |
15 | 00: 00: 00.0031166 | 00: 00: 00.0003179 | 0.1 |
sixteen | 00: 00: 00.0030852 | 00: 00: 00.0003141 | 0.1 |
17 | 00: 00: 00.0031178 | 00: 00: 00.0003160 | 0.1 |
18 | 00: 00: 00.0030913 | 00: 00: 00.0003160 | 0.1 |
nineteen | 00: 00: 00.0030787 | 00: 00: 00.0003133 | 0.1 |
20 | 00: 00: 00.0030818 | 00: 00: 00.0003118 | 0.1 |
Next, it is worth seeing what the difference will be if the data does not follow any logic at all. For the test, a file was created with a completely random set of given characters:
ORegex pattern: {a} ({b} {a}) +; Regex pattern: a (ba) +
No | Oregex | Regex | Ratio |
---|---|---|---|
one | 00: 00: 00.1785877 | 00: 00: 00.0622146 | 0.35 |
2 | 00: 00: 00.2141055 | 00: 00: 00.0578735 | 0.27 |
3 | 00: 00: 00.2148457 | 00: 00: 00.0539055 | 0.25 |
four | 00: 00: 00.1984781 | 00: 00: 00.0499280 | 0.25 |
five | 00: 00: 00.2073454 | 00: 00: 00.0634693 | 0.31 |
6 | 00: 00: 00.1592644 | 00: 00: 00.0842834 | 0.53 |
7 | 00: 00: 00.2167805 | 00: 00: 00.0527719 | 0.24 |
eight | 00: 00: 00.2012316 | 00: 00: 00.0511291 | 0.25 |
9 | 00: 00: 00.1928555 | 00: 00: 00.0504931 | 0.26 |
ten | 00: 00: 00.1887427 | 00: 00: 00.0546691 | 0.29 |
eleven | 00: 00: 00.1663168 | 00: 00: 00.0741461 | 0.45 |
12 | 00: 00: 00.1628335 | 00: 00: 00.0742250 | 0.46 |
13 | 00: 00: 00.2077626 | 00: 00: 00.0481913 | 0.23 |
14 | 00: 00: 00.1709487 | 00: 00: 00.0501648 | 0.29 |
15 | 00: 00: 00.1869102 | 00: 00: 00.0477373 | 0.26 |
sixteen | 00: 00: 00.2173555 | 00: 00: 00.0629728 | 0.29 |
17 | 00: 00: 00.1897196 | 00: 00: 00.0495571 | 0.26 |
18 | 00: 00: 00.1939370 | 00: 00: 00.0494173 | 0.25 |
nineteen | 00: 00: 00.2249846 | 00: 00: 00.0479044 | 0.21 |
20 | 00: 00: 00.2037242 | 00: 00: 00.0815932 | 0.4 |
In this case, the lag of the ORegex from the Regex is only about three times.
Well, at the moment it is clear that ORegex loses in finding character sequences in highly structured data and random sets, but what about the erroneous cases? The test is as simple as the two tools bactracking-prüff. To do this, a special pattern was set, and the size of the string from one 'x' gradually increased by 150 characters, as a result of the third iteration, the Regex was almost 30 times behind:
ORegex pattern: {x} + {x} + {y} +; Regex pattern: x + x + y +
No | Oregex | Regex | Ratio |
---|---|---|---|
one | 00: 00: 00.0082144 | 00: 00: 00.0094231 | 1.15 |
2 | 00: 00: 00.0005045 | 00: 00: 00.0064406 | 12.76 |
3 | 00: 00: 00.0114383 | 00: 00: 00.3322330 | 29.05 |
Summing up, I want to say that, of course, using this tool to search for substrings is unwise and stupid. There are faster regex engines for this. If the task is to quickly, simply and clearly find a pattern in a sequence of objects (the genome in a sequence of genes, simple entities in words, an object in a memory dump, etc.), then such differences in speed are quite acceptable in my opinion.
That's all. Thank you for your attention and patience! =)
Source: https://habr.com/ru/post/305256/