📜 ⬆️ ⬇️

Regular expression simplicity test

I saw a lot of problems with regular expressions, but last Friday, thanks to Chris and Sean, I found one regular schedule that allows you to check whether a given integer is simple. The original articles suggested the following regular expression to determine the simplicity of a number:


It should not be applied to the integer number itself. Instead, you need to create a string of units, where the number of ones corresponds to the number itself and already apply a regular expression to this line. If there is no match, the number is simple. This regular expression contains backlinks and therefore will not work on a DFA engine, such as PHP ereg * functions. But it works great with the preg_ * functions, or at least I thought so (more on this below).
So how does it work? It looks like a real puzzle, but in reality everything is simple. Ignore its part before the | character, since it is intended simply to find out if the line is completely empty or consists of one unit.

The subexpression (11+?) Matches the strings like 11, 111, and so on ... The "\ 1+" part will search further along the string to match the search results of the first subexpression. The first time a match occurs on the string "11" and then the search string "11" will be made again, and then again to the end of the line. If the search succeeds, the number is not simple. Why? Because it will be proved that the length of the string is divided by 2 and, accordingly, the number itself is also divided by two. If there is no coincidence, the regular expression engine will start searching for the string “111” and its repetitions (thus realizing further the sieve of Eratosthenes - approx. Lane). If the first subexpression becomes long enough (n / 2) and no match is still found, the number will be simple.

Recently, Sean has written a plugin to execute code for an IRC bot based on Phergie, which hangs on the same channel in which we communicate. The plugin itself is just a proxy to ideone.com, but it is useful for quick code tests. We experimented with this regular expression pattern and wrote a PHP function that returns a prime number following the transferred integer. The trouble started when Sean passed this function to 99999999 and it returned 10,000,0001. It seems like an error has occurred since 10,000,0001 is not a prime number (= 17 * 5882353). Wolfram Alpha confirmed this.

After several similar tests, we found more numbers that were not simple, but passed our small test. We wondered where might be the problem? The PHP code was too simple to have a bug, quite a lot of the answers received from our function were correct. It seems it's time to use the brute force method. I quickly wrote code to test the sequence of odd numbers to our pattern and began to check the responses of our function with the Eratosthenes sieve to see where the results are wrong. The first erroneously found number was 22201. The check by our regular program told us that it should be simple, but in general it is a square 149. After this limit, the number of erroneously identified simple ones increased.

Suddenly it dawned on me that the problem may be hiding in the mechanism of backlinks in regular expressions. In particular, in exactly how it is implemented in PCRE, the heart of regular expressions in PHP. As I mentioned in a post at Regex Clinic, the unlimited use of backlinks leads to a dramatic decrease in the speed of processing text with a regular expression and therefore you should think carefully before using it to write complex regular expressions. In order to eliminate the danger of abuse of this mechanism, a restrictive parameter pcre.backtrack_limit was implemented in PCRE several years ago. In our case, backlinks were used to break the text from units into a large number of parts and with very large lines, this could lead to exceeding the set limit, which by default is 100000. I thought that with a string of 22201 characters this limit was not enough. As soon as the limit was reached, the coincidence refused and the number was declared simple.

I increased the limit to 200,000 and, voila! .. 22201 was no longer defined as simple. In order to fix the work of the regular expression with the coincidence of 00100000001, the limit had to be seriously raised right up to 250000000! The regular expression run took about 14 seconds on my new i5 MacBook Pro.

Needless to say, the approach I have described for checking the number for simplicity should not be used in real life. Just appreciate its elegance. I appreciated. And I am glad that my small research has shown that abstract, clean, good ideas may simply not work in our crazy world of software and hardware.

Source: https://habr.com/ru/post/104768/

All Articles