Java Regular Expressions Prefix Optimization

I want to talk about a simple way to optimize regular expressions, or rather dictionaries. I saw some projects that optimize finite automata , packages that make quick dictionary layout in the text , but just to take the dictionary and put together a regular expression that could be passed to any regular expression engine - I haven’t seen one like this yet.

So the problem is, there is a huge dictionary of cities and you need to find all these phrases in the text. The naive approach consists in gluing or or these words together, thus we get the following expression (city1 | city2 | city3 | ... cityN). When processing such an expression, the usual NDA engine (to which the majority belongs, including the JDK) will do at least N checks for each character in the text, in the worst case (when the current word differs on the last letter from all words in the dictionary) checks will be equal to the number of letters in the dictionary.
This is bad, but you can do better.
A typical language property is redundancy. In this case, it is the repeatability of a sequence of letters. Here I will talk about the simplest optimization method - prefix optimization.
If words begin with the same prefix, then the calculation will be performed once for any prefix. So we build Trie in our dictionary, and then convert it to a regular expression string.

Tree class

class Node { char ch = START; List<Node> nodes = Lists.newArrayList(); void add(String str) { if (str.length() == 0) return; char chNew = str.charAt(0); for (Node n : nodes) { if (n.ch == chNew) { n.add(str.substring(1)); return; } } Node newNode = new Node(); newNode.ch = chNew; newNode.add(str.substring(1)); nodes.add(newNode); } String toRegexp() {...} }

As we see its main method, add checks whether there is a first character among its children, if not, it creates and gives to the subtree that starts with that character.
Thus, in this structure, any prefix is stored only once (the path through the tree) and is reused when it occurs in our lines.

The second method converts the tree into a regular expression.

  String toRegexp() { StringBuilder str = new StringBuilder(); if (ch == START) { } else if (ch == END) { } else { //convert special characters like {}[]. String newStr = escapeRegexp(String.valueOf(ch)); str.append(newStr); } if (nodes.size() > 1) { str.append("(?:"); for (Node n : nodes) { str.append(""); str.append(n.toRegexp()); str.append("|"); } str.setLength(str.length() - 1); str.append(')'); } else if (nodes.size() == 1) { str.append(nodes.get(0).toRegexp()); } return str.toString(); } }

')
Here is the working code

 public static String convertListToRegexp(final boolean useNonCapturingGroups, String... strs) { Arrays.sort(strs, new Comparator<String>() { public int compare(String o1, String o2) { int res = o2.length() - o1.length(); if (res != 0) { return res; } return o1.compareTo(o2); } }); Node root = new Node(); for (String str : strs) { root.add(str + "$"); } return root.toRegexp(); }

and example

 //create array of your entries String[] examples = new String[]{"javvva", "javggaaa", "javajava", "adsasd", "adasddsa"}; //convert them to optimal regexp String optimizedRegexp = RegExpUtils.convertListToRegexp(true, examples); Assert.assertEquals("(?:ad(?:asddsa|sasd)|jav(?:ajava|ggaaa|vva))", optimizedRegexp); //check that it is works for(String s : examples) Assert.assertTrue(s.matches(optimizedRegexp));

Source: https://habr.com/ru/post/117177/

All Articles

Java Regular Expressions Prefix Optimization

More articles: