📜 ⬆️ ⬇️

Building regexp on input lines S1..SN

That quite recently I ran into a problem, according to which I couldn’t dig, not that of any libraries, but even theories or algorithms. Because time was running out, he decided to deal with the task himself. He wrote an article for those who will face a similar task in the future, and interesting criticism. How would you solve this problem?

So, the task ...


At the input of the algorithm there is a set of strings S1..SN. It is required, according to the given lines, to construct such a minimal regular expression R, so that R (Si) = true, i [1, N] (N is of the order of several thousand).
The requirement of minimality is not strict, and it is not required to prove the minimality of the constructed regexp. If the lines S1..SN have some similar structure, then regexp should detect this structure. The standard assignment to the programmer is moderately specified, but also with some discretion.

Algorithm of the decision ...


  1. Find the smallest regular expression for the strings S1 and S2:
    • We add characters of the beginning and end of each line to each line, it turns out S1 = ^ S1 $
    • We are looking for the maximum common substring for S1 and S2. Let this U be the maximum common substring for S1, S2. (There are several well-known algorithms for finding the maximum common substring , and if you have the means, you can buy IBM's Teiresias algorithm )
    • So it turns out that the substring U divides each of the lines S1 and S2 into 2 parts - left and right:
      S1: S1L || U || S1R
      S2: S2L || U || S2R

      where SiL and SiR are the left and right substrings of the strings S1 and S2.
    • As can be seen, recursion with a binary tree suggests itself. We put the string U in the root of the tree. If both strings S1L and S2L are non-empty, then create a left descendant and run a comparison for them. Similarly for right substrings.
      U
      / \
      UL UR

    • If at some stage the common substrings are not found (or it is short, 1-2 characters), then put the ". *" In the current node and exit the last recursive call. This is a very rude expression, but it is suitable for my task. At this step there is a lot of room for the optimization of the algorithm for your needs.
    • After the tree is built, by symmetric traversal we get the desired regular expression SX for S1 and S2 (You may have to add the beginning and end of the expression again, because in some cases they turn into ". *")

  2. Comparing the resulting SX string with S3 using the same algorithm, we get SX - a regular expression for the first 3 lines.
  3. By induction over all strings, we get R = SX the desired regular expression.


UPD Example ...
Let's take as an example links to some articles of habr:
')
S1=http://habrahabr.ru/blogs/edu_2_0/40236/ true
S2=http://habrahabr.ru/blogs/microsites/40089/ true
S3=http://habrahabr.ru/blogs/google_chrome/38748/ true
S4=http://habrahabr.ru/blogs/show/37839/ true
S5=http://nikolaikopernik.habrahabr.ru/blog/54889/ true
S6=http://habrahabr.ru/blogs/telecom/39902/ true
-----------------
REGEXP = ^http://.*habrahabr.ru/blog.*/$
SIZE : 6
TIME : 0.0070 s


The example is made in Java. I tested it on 6000 lines - I built regexp for 2 seconds.

UPD Algorithm is implemented and works. It builds sufficiently specified regular expressions on decent amounts of data for a reasonable time. Criticize and use.

UPD Let me explain why this may be necessary and why a simple union ^ (S1 | .. | SN) $ does not roll. Imagine that the system works with a stream of lines F1, F2, F3, ... All the lines of similar structure (for example, urls). Some lines are “good”, others are “not good” :) And checking for “goodness” is a very expensive operation. So, from the general stream of lines after long checks, it was found out that the lines S1..SN are “good” lines. In the system, the assumption is valid that if the line Fi has a similar structure with the lines S1..SN, then with a high degree of probability it will be “good”. This is what regexp is built on based on N lines - so that only by the appearance of the line to determine its “goodness” :) I hope I didn’t be too smart with the description :)

Source: https://habr.com/ru/post/54889/


All Articles