ORegex: Is it fast enough for objects?

Good evening, habrazhiteli! Today I want to share a small performance assessments ORegex .NET.
If you read my previous article here , in my opinion it was not very convincing to represent something without a comparative assessment of speed, don't you think so? If yes, then you are under cat.

Long thought about what to include in the benchmark, so as not to torment yourself for a long time. I didn’t want to spend a lot of time / effort on this, and since I’m a science fiction writer and I’m quite useless, I decided to take the simplest comparison - search for a substring in the text. In my opinion for the performance test is the best solution, because the symbol is also an object. It is worth noting that all the code is present in the test project on the git hub , you can run it if you just have nothing to do =)

We have ORegex and Regex engine from Microsoft, all you need is to describe similar patterns for both sides, and lambda for each character for ORegex, and you need to remember to come up with test cases. It did not take long to think, it was decided to see how the tools would cope with three tasks:

Reality (parse html tags, good practical testing on highly structured data)
Random (How are they with finding occurrences in a completely random data set)
Error (Nobody canceled backtracking, so you need to look at the sets in which very deceptive data)

Reality

A page from the news site was selected and the task was set to find all p tags in 20 iterations. As expected, the Regex is ahead. Both tools in the initial iteration showed approximately the same cold start results. After the first iteration, the difference in speeds differed by about 10 times:

ORegex pattern: {b1o} {p} {b1c}. *? {B1o} {slash} {p} {b1c}; Regex pattern: <p >. *? </p >

No	Oregex	Regex	Ratio
one	00: 00: 00.0040204	00: 00: 00.0058571	1.46
2	00: 00: 00.0030944	00: 00: 00.0003172	0.1
3	00: 00: 00.0032093	00: 00: 00.0003195	0.1
four	00: 00: 00.0031040	00: 00: 00.0003172	0.1
five	00: 00: 00.0032354	00: 00: 00.0003149	0.1
6	00: 00: 00.0031703	00: 00: 00.0003153	0.1
7	00: 00: 00.0031220	00: 00: 00.0003187	0.1
eight	00: 00: 00.0030883	00: 00: 00.0003187	0.1
9	00: 00: 00.0036790	00: 00: 00.0003674	0.1
ten	00: 00: 00.0030902	00: 00: 00.0003145	0.1
eleven	00: 00: 00.0030787	00: 00: 00.0003130	0.1
12	00: 00: 00.0030752	00: 00: 00.0003149	0.1
13	00: 00: 00.0030975	00: 00: 00.0003183	0.1
14	00: 00: 00.0032250	00: 00: 00.0003777	0.12
15	00: 00: 00.0031166	00: 00: 00.0003179	0.1
sixteen	00: 00: 00.0030852	00: 00: 00.0003141	0.1
17	00: 00: 00.0031178	00: 00: 00.0003160	0.1
18	00: 00: 00.0030913	00: 00: 00.0003160	0.1
nineteen	00: 00: 00.0030787	00: 00: 00.0003133	0.1
20	00: 00: 00.0030818	00: 00: 00.0003118	0.1

Random

Next, it is worth seeing what the difference will be if the data does not follow any logic at all. For the test, a file was created with a completely random set of given characters:

ORegex pattern: {a} ({b} {a}) +; Regex pattern: a (ba) +

No	Oregex	Regex	Ratio
one	00: 00: 00.1785877	00: 00: 00.0622146	0.35
2	00: 00: 00.2141055	00: 00: 00.0578735	0.27
3	00: 00: 00.2148457	00: 00: 00.0539055	0.25
four	00: 00: 00.1984781	00: 00: 00.0499280	0.25
five	00: 00: 00.2073454	00: 00: 00.0634693	0.31
6	00: 00: 00.1592644	00: 00: 00.0842834	0.53
7	00: 00: 00.2167805	00: 00: 00.0527719	0.24
eight	00: 00: 00.2012316	00: 00: 00.0511291	0.25
9	00: 00: 00.1928555	00: 00: 00.0504931	0.26
ten	00: 00: 00.1887427	00: 00: 00.0546691	0.29
eleven	00: 00: 00.1663168	00: 00: 00.0741461	0.45
12	00: 00: 00.1628335	00: 00: 00.0742250	0.46
13	00: 00: 00.2077626	00: 00: 00.0481913	0.23
14	00: 00: 00.1709487	00: 00: 00.0501648	0.29
15	00: 00: 00.1869102	00: 00: 00.0477373	0.26
sixteen	00: 00: 00.2173555	00: 00: 00.0629728	0.29
17	00: 00: 00.1897196	00: 00: 00.0495571	0.26
18	00: 00: 00.1939370	00: 00: 00.0494173	0.25
nineteen	00: 00: 00.2249846	00: 00: 00.0479044	0.21
20	00: 00: 00.2037242	00: 00: 00.0815932	0.4

In this case, the lag of the ORegex from the Regex is only about three times.

Mistake

Well, at the moment it is clear that ORegex loses in finding character sequences in highly structured data and random sets, but what about the erroneous cases? The test is as simple as the two tools bactracking-prüff. To do this, a special pattern was set, and the size of the string from one 'x' gradually increased by 150 characters, as a result of the third iteration, the Regex was almost 30 times behind:

ORegex pattern: {x} + {x} + {y} +; Regex pattern: x + x + y +

No	Oregex	Regex	Ratio
one	00: 00: 00.0082144	00: 00: 00.0094231	1.15
2	00: 00: 00.0005045	00: 00: 00.0064406	12.76
3	00: 00: 00.0114383	00: 00: 00.3322330	29.05

Results

Summing up, I want to say that, of course, using this tool to search for substrings is unwise and stupid. There are faster regex engines for this. If the task is to quickly, simply and clearly find a pattern in a sequence of objects (the genome in a sequence of genes, simple entities in words, an object in a memory dump, etc.), then such differences in speed are quite acceptable in my opinion.
That's all. Thank you for your attention and patience! =)

Source: https://habr.com/ru/post/305256/

All Articles

ORegex: Is it fast enough for objects?

Reality

Random

Mistake

Results

More articles: