📜 ⬆️ ⬇️

ORegex: From characters to objects

Good evening, hrobrozhiteli!
Today I want to share with you such a young project like ORegex or Object Regular Expressions. I have been working in computer linguistics for quite a long time and, although I am not a linguist, I still see some well-established constructions and patterns in languages.
For those who are interested in how I decided to allocate them - under the cat.

These templates can be as simple as:

So complex:


Basically my job was to understand what and how to extract from a sequence of objects. This can be done through grammars, through automata, or you can simply write a couple of nested loops. But when I was specifically bothered to write tons of parsers of various sequences (tokens, words, word combinations, etc.) with varying complexity and a huge number of bugs, a reasonable question occurred to me - is it possible to make it easier? The answer came intuitively: use regular expressions to search.

But how? Regular expressions, of course, do well and quickly with the task of searching by pattern, only all the engines are written exclusively for character sequences, and those that are sharpened for objects are not at all pleased with their speed and are generally in other linguistic planes. As a result, after some deliberation, it was decided to write his own “bicycle with a normal gear system”.
')
I decided to make the project open source for a number of reasons, but this is not about that now.
Let's take a look at how it can be used at all in combat conditions.

How to use?

I did not think long over the syntax, it was decided to use the standard .NET + notation to add comments and write normal names for atomic conditions. This would allow without problems to make patterns in separate files:

{MyPredicate1} | (? {MyPredicate2} {MyPredicate3} *)

It is worth noting that at the moment some functions of the .NET Regex are not included (conditional operators, lookahed), but they will definitely appear in the near future. And now for the examples themselves. Suppose we have a sequence of numbers:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13

And your task is to find all consecutive primes. To do this, you need to define a function that will answer the question whether the number is simple or not:

private static bool IsPrime(int number) { int boundary = (int)Math.Floor(Math.Sqrt(number)); if (number == 1) return false; if (number == 2) return true; for (int i = 2; i <= boundary; ++i) { if (number % i == 0) return false; } return true; } 


And define a pattern for finding simple sequences:

{IsPrime} (. {IsPrime}) *

On this, by and large, the difficult part is complete. We describe the selection procedure itself:

  public void PrimeTest() { var oregex = new ORegex<int>("{0}(.{0})*", IsPrime); var input = new int[] {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}; foreach (var match in oregex.Matches(input)) { Console.WriteLine(string.Join(",", match.Values)); } //OUTPUT: //2 //3,4,5,6,7 //11,12,13 } 


Well, that's all, but not so convincing, right? Well, then let's give a little more complicated. Imagine that we have a certain sequence of words that came to us from a lexico-morphological module. The question is how to quickly select the names of persons from the sequence? Pretty simple.

We define the classes of the word and person:

  public enum SemanticType { Name, FamilyName, Other, } public class Word { public readonly string Value; public readonly SemanticType SemType; public Word(string value, SemanticType semType) { Value = value; SemType = semType; } } public class Person { public readonly Word[] Words; public readonly string Name; public Person(OMatch<Word> match) { Words = match.Values.ToArray(); Name = match.OCaptures["name"].First().Values.First().Value; //Now just normalize this name and you are good. } } 


And additionally some important function that will determine that the string is actually the initial:

  private static bool IsInitial(string str) { var inp = str.Trim(new[] { '.', ' ', '\t', '\n', '\r' }); return inp.Length == 1 && char.IsUpper(inp[0]); } 


Without further ado, we make a predicate table, a pattern, and get our person objects:

  public void PersonSelectionTest() { //INPUT_TEXT:          .. var sentence = new Word[] { new Word("", SemanticType.FamilyName), new Word("", SemanticType.Name), new Word("", SemanticType.Other), new Word("", SemanticType.Other), new Word("", SemanticType.Name), new Word("", SemanticType.Other), new Word("", SemanticType.Other), new Word("", SemanticType.Other), new Word("", SemanticType.Name), new Word(".", SemanticType.Other), new Word("", SemanticType.Other), }; //  . var pTable = new PredicateTable<Word>(); pTable.AddPredicate("", x => x.SemType == SemanticType.FamilyName); //Check if word is FamilyName. pTable.AddPredicate("", x => x.SemType == SemanticType.Name); //Check if word is simple Name. pTable.AddPredicate("", x => IsInitial(x.Value)); //Complex check if Value is Inital character. //      . var oregex = new ORegex<Word>(@" {}(?<name>{}) //Comments can be written inside pattern... | (?<name>{})({}|{}{1,2})? /*...even complex ones.*/ ", pTable); //   . var persons = oregex.Matches(sentence).Select(x => new Person(x)).ToArray(); foreach (var person in persons) { Console.WriteLine("Person found: {0}, length: {1}", person.Name, person.Words.Length); } //OUTPUT: //Person found: , length: 2 //Person found: , length: 1 //Person found: , length: 3 } 


Well that's all. I tried to describe everything briefly and clearly =)
If that library is available both in nuget and on github .

Source: https://habr.com/ru/post/278219/


All Articles