Strongly typed combinators for constructing a parser and a natural language synthesizer

The famous ParserCombinator 's and Parboiled are intended solely for parsing formal languages. We solve the problem of parsing a natural language and at the same time we want that with the help of the same grammar it was possible to carry out the synthesis of phrases in natural language, reflecting the semantics we need. It would be convenient to be able to describe language constructs along with abstraction / concretization rules.

For example,

Conversion of numbers to numbers (“ten” -> 10: Int)
and back (10: Int -> "ten" ("tenth", "ten" ...))
Conversion of numerals together with the unit of measurement ("ten rubles" <-> NumberWithMeasurement (10, RUB))
Partial address ("Yablochnaya St. <-> Address (street =" Apple "))
Address within the city (“Apple House street one hundred twenty three apartment forty five” <-> Address (street = “Apple”, building = 123, flat = 45))
Phone (256-00-21 (“two hundred fifty six zero zero twenty one”) <-> NumericalSequence (256,0,0,21))

And I would like to have the following system properties:
')

the uniqueness of the description of the rules of abstraction / specification
strictly typed representation of semantics at all levels of abstraction
the availability of alternative forms of representation of semantics and the ability to influence the choice of the form of representation of semantics
coordination of word forms for obtaining a phrase in pure Russian
the possibility of forming secondary structures based on the original rules. In particular, we would like to form parsing grammar corresponding to the rules.

Under the cut - a description of the approach implemented in the library synapse-typed-expressions. Only numerals are considered, but the approach naturally extends to the other aforementioned formal language constructs.

Levels of abstraction

The same semantics (meaningful information) is presented in different forms at different levels of abstraction. Consider for example, how the number 1234 may look at different levels of abstraction:

1234: Int - at the level of application business logic
Number (1234L)
TwoClassNumber (Number (1L), Order (1000), Number (234L)) - splitting a number into two classes - thousands and ones
Seq (Number (1L), Order (1000), Number (234L)) - a chain of components
Seq (Number (1L), Order (1000), TwoRangesNumber (Number (200L), 100, Number (34L)))
Seq (Number (1L), Order (1000), TwoRangesNumber (Number (200L), 100, TwoRangesNumber (Number (30L), 10, Number (4L)))) - reduced to the numeral structure
Seq (Word (1L), Word (1000L), Word (200L), Word (30L), Word (4L)) - a string of word identifiers
Seq (one, one thousand, two hundred, thirty, four) - a string of words after the application of the rules of harmonization of word forms
“One thousand two hundred thirty-four” is the text.

In this example, one can observe a descending process of concretization from the abstract strongly typed value 1234 to the text in Russian. Each step of instantiation can be described using substitution rules.

To implement a parser, it is necessary, with the same rules of substitution, to make matching attempts using, for example, a cut-off and backtracking mechanism, or a CKY parser.

It should be noted that the actual content or semantics at all levels remains unchanged. Only the form of representation of semantics changes. In this case, we clearly define the direction on the axis of abstraction / specification.

The next observation that can be made is that in the upward movement (increasing the level of abstraction) we consistently discard some details (for example, the shape of a word, splitting a number into its component parts). And with the downward movement (concretization, synthesis) we must complement the existing semantics with some details (the type of numeral is ordinal or quantitative, the form of a word, etc.). The specification of missing information is provided

partly by rules based on semantics proper,
partially - determined by the context of the entire transformation (for example, ordinal or cardinal),
partially - non-local rules that take into account neighboring elements (choice of the form of the word).

Grammar rules for numerals

We describe the rules of specificity for different ranges of numbers:

 "Number from 1 to 9": = ID1 or ID2 or ... ID9
 "Number from 10 to 19": = ID10 or ID11 or ... ID19
 "Number from 1 to 19": = ID1 or ID2 or ... ID19
 "Tens from 20 to 90": = ID20 or ID30 or ... ID90
 "Hundreds from 100 to 900": = ID100 or ID200 or ... ID900
 "Number represented by one word": = "Hundreds from 100 to 900" or "Tens from 20 to 90" or "Number from 1 to 19"

The above rules allow you to specify a number from the set (1, 2, ... 20, 30, ... 100, 200, ... 900) in the form of a single identifier of the Russian word. Identifiers are also denoted by numbers, and the form of the word is not yet specified. Our rules do not work for all numbers yet. If you try to work with the wrong number, we will have a failure. Also, our rules do not allow to recognize more than one word.

Add rules that allow you to recognize two or three words, organized according to the rules of the numerals of the Russian language:

 "Number represented by two words": = ("Hundreds from 100 to 900" then "Tens from 20 to 90") 
                                       or ("Hundreds from 100 to 900" then "Number from 1 to 19") 
                                       or ("Tens from 20 to 90" then "Number from 1 to 9")
 "Number represented by three words": = "Hundreds from 100 to 900" then "Dozens from 20 to 90" then "Number from 1 to 9"

To simplify further reasoning, we introduce an abbreviated notation for the rules that are matched with ranges of numbers:

`[100..900 / 100]` will denote numbers from the range from 100 to 900 in increments of 100.
Instead of “or” we will use the icon |, and instead of “then” we will use the icon ~.
Before the name of the rule we will write the keyword val, and the definition of the rule itself will be written after the equal sign.
Instead of ID20, we will use the functional record id (20)

This way of writing allows us to write the rules more compactly.

	 val `[0]` = id (0L)
	 val `[1..9]` = (1L to 9) .map (id)
	 val `[0..9]` = `[1..9]` |  `[0]`
	 val `[1..19]` = (1L to 19) .map (id)
	 val `[10..19]` = (10L to 19) .map (id)
	 val `[20..90 / 10]` = (20L to 90 by 10) .map (id)
	 val `[100..900 / 100]` = (100L to 900 by 100) .map (id)
	 val singleWordNumber = `[100..900 / 100]` |  `[20..90 / 10]` |  `[1..19]`

How to present numerals composed of two words? From a syntactic point of view, such numerals are represented by two expressions connected by a sequence operator ~. From the point of view of increasing the level of abstraction, we have two ways of communication - addition (“twenty” and “one” = 20 + 1 = 21) and multiplication (“five” “thousands” = 5 * 1000 = 5000). And with a decrease in the level of abstraction division with the remainder (21% 10 = 1, 21-1 = 20) and just division (5000/1000 = 5). We will present a method of conversion between levels of abstraction by an additional operator ^^, which reports that there is a specified transformation between levels. For example, we can present numerals represented by two words in the range from 20 to 99 in this way:

	 val `[20..99] without 20..90 / 10` =` [20..90 / 10] `~` [1..9] `^^ ModSplit (10L)

Single-word numerals are not included in this expression. They differ in that the second component of the pair (unit) is omitted, but from a mathematical point of view it is represented by zero. Grammatically, this situation can be represented using the Epsilon symbol, which is matched with an empty stream, and at the top level of abstraction is represented as a constant. We denote this phenomenon with a special icon |? :

	 val `[20..99]` = `[20..90 / 10]` ~ (`[1..9]` |? 0L) ^^ ModSplit (10L)

This expression, when parsed, makes it possible to compare with any numerals in the range from 20 to 99, and during synthesis it makes it possible to get a couple of numbers — tens and units, or only tens.

To represent a range from 1 to 99, we can combine two expressions using the operator | . To ensure an unambiguous choice during instantiation, add a LessThanSelector(20L) selector LessThanSelector(20L) that selects the right expression, in case the number is less than the threshold.

	 val `[1..99]` = `[20..99]` |  `[1..19]` selectBy LessThanSelector (20L)

Similarly, numbers in the range from 1 to 999 are represented.

	 val `[100..999]` = `[100..900 / 100]` ~ (`[1..99]` |? 0L) ^^ ModSplit (100L)
	 val `[1..999]` = `[100..999]` |  `[1..99]` selectBy LessThanSelector (100L) labelled "1..999"

To represent numbers over a thousand, it is necessary to use an OrderSplit(1000) transformation, which during the upward movement will multiply the number of thousands by a thousand, and during the downward movement will divide, thereby dividing the number into the highest class and the remainder:

	 val `[1 000..999 000/1000]` = `[1..999]` ~ `[1 000]` ^^ OrderSplit (1000L)
	 val `[1 000..999 999]` = `[1 000..999 000/1000]` ~ (`[1..999]` |? 0L) ^^ ModSplit (1000L)
	 val `[1..999 999]` = `[1 000..999 999]` |  `[1..999]` selectBy LessThanSelector (1000L)

The expression for numbers in an arbitrary range can be represented using the recursive function

   def range1To999Order (order: Long): NE = order match {
     case 1L => `[1..999]`
     case o if o> = 1000 =>
       val lower = range1To999Order (order / 1000)
       val ordNE = order: NE
       val upper = `[1..999]` ~ * ordNE
       ((upper ~ + lower) | upper selectBy OrderSplit (order)) |  lower selectBy RangeSelector (order)
   }

as an order here indicates a million, billion, etc.

Description of the rules of communication between levels of abstractions

Different forms of semantics can be represented in the programming language by different types of data. Therefore, the transition between levels of abstraction must be provided with two types - the type of a more specific form (L - Lower) and the type of a more abstract form (U - Upper):

	 sealed trait Expression [L, U]

The Expression type describes a grammar corresponding to a part of the input stream L on the one hand, and the object U on the other hand. You can consider this object as a syntactic component of the rules. Further transformation of the object to a higher level of abstraction is carried out by some function that can be represented by one of Transformer 's descendants:

	 sealed trait Transformer [M, U]

In general, the grammatical expression along with the functional transformation is also an expression:

	 case class Transformed [L, M, U] (e: Expression [L, M], t: Transformer [M, U]) extends Expression [L, U]

The transformation function already introduces some semantic interpretation to the matched expression. Therefore, the Transformer is a semantic component of the rules.

Conversion of forms using one of the rules may fail. Therefore, the conversion function should be able to report that the rule does not fit. For these purposes, one of the variants of the Result / Option / Try / Either pattern is suitable:

	 sealed trait ParseResult [T]
	 case class Success [T] (value: T, tail: LemmaStream) extends ParseResult [T]
	 case class Failure [T] (reason: String) extends ParseResult [T]

The rules themselves are represented using a set of Case classes, which are constructed as described above DSL - Pair, BooleanAlternative, MapAlternative, ConstExpr

Dictionary view

In relation to the dictionary, the parsing and synthesis tasks have slightly different requirements. For the purposes of parsing (enhancing abstraction), it is required to associate with each word only the information that is required at higher levels of abstraction. In particular, in most cases it is enough for us to know the numerical value associated with the word to form the final number at the top level.
However, for the purpose of synthesizing text, it is necessary to know the morphological form of the word so that it is possible to choose a suitable word in form. The morphological form can be represented as a series of meanings of grammatical categories, such as gender, number, case, etc.

Generally speaking, for the task of harmonizing the forms of numerals it is not necessary to describe the morphological form of a word using standard grammatical categories. Since only a limited set of combinations of grammatical categories is actually used, it is possible to completely manage with a surrogate grammatical category - “group of agreement”. There are only 3 such groups: one, from 2 to 4, and more than 4. All numerals within the “coordination group” behave the same way from the point of view of the choice of the form of a word.
However, since we consider the task somewhat broader than the transformation of numerals, and include formal language constructions such as address and telephone, it will be more convenient, nevertheless, to adhere to the standard grammatical classification.

We present both a grammatical category and its meaning (“grammemes”) as objects, so that it is convenient to use them as a key value.

	 case object Gender extends GrammarCategory {
	   val default = Masculine
	   case object Masculine extends Grammem
	   case object Femini extends Grammem
	   case object Neuter extends Grammem
	 }

The Grammem type Grammem declared inside the GrammarCategory , which ensures that the gramme type is compatible with the grammatical category at the compilation stage. In particular, if we need a grammar grammar (for example, the average Neuter ), then we can declare the type as Genger.Grammem . And at the same time, we can use the Gender object as a key in the list of values of grammatical categories.

Actually the dictionary is a collection of pairs - (string representation of a word, grammatical characteristic). The dictionary is formed using a small DSL, which allows compactly recording all the details:

   lazy val allWordForms = {
     associate ("three four five six seven eight nine",
       3L to 9, new WordFormDescription (Masculine, Cardinal, Nominative, Ones)) :::
       associate ("zero",
         List (0L), new WordFormDescription (Cardinal, Nominative, Ones)) :::
       associate ("one two",
         List (1, 2), new WordFormDescription (Femini, Cardinal, Nominative, Ones)) :::
       associate ("one two",
         List (1, 2), new WordFormDescription (Masculine, Cardinal, Nominative, Ones)) :::
       associate ("one",
         List (1), simpleNumerical WordForm (Ones, Neuter)) :::
       associate ("ten eleven twelve thirteen fourteen" +
         "fifteen sixteen seventeen eighteen nineteen",
         10L to 19, simpleNumerical WordForm (Teens)) :::
    ...
       associate ("zero first second third fourth fourth sixth seventh eighth eight",
         0L to 9, new WordFormDescription (Masculine, Ordinal, Genetive, Ones)) :::
    ...
       associateOrder ("million million million", 1000000L) :::
    ...
       associate ("minus", List (-1L), WordFormDescription.empty)

Parsing text using declarative rules

Based on the described grammar, you can either directly parse using any interpreter, or construct a parser by converting the original grammar. To parse natural text in which errors are permissible, a probabilistic parser is needed, for example, CKY.

Backtracking
The text section in the parsing process is matched with the template set by the above rules. If the rule returns Success, we consider that the matching was successful, and we can continue the analysis further. If the rule does not fit, then we perform backtracking to search for alternative matching methods. When using immutable data structures, the clipping step of the unsuccessful variant and returning to the alternative path selection point is implemented very simply - just ignore the result and continue the parsing. Data is not required to adjust to the original state.
CKY - parser
To use a probabilistic CKY-parser, you must first convert the rules into grammar in Chomsky's normal form. Since the grammar in this case will be a binary tree, it is possible to systematically sort through all possible ways of comparing this grammar with the original string of words. At the same time, a different probability will be assigned to different variants of analysis (which, in particular, immediately allows for the analysis of texts with replacement errors). In the course of such a transformation, it turns out to be useful to combine the terminals that are equivalent in terms of the parsing algorithm to a variety of non-terminals - preterminals.

Constructing a backtracking parser

For the purposes of this article, a backtracking algorithm is sufficient (“cut-off and return algorithm”). A parser is a function that accepts a stream of lemmas ( LemmaStream ) as input, and, if successful, returns some value and a tail of the stream. The tail can be used for further analysis. In case of failure, the “cut-off and return” algorithm will discard this parser and try an alternative variant. The “return”, therefore, is accomplished through the use of immutable structures and the presence of a tail of a stream of lemmas at each level.
The result of the parser can be represented by the ParseResult[T] with two descendants - Success and Failure .

Converting expressions into a parser is performed using the backTrackingParser function:

   def backTrackingParser [U] (e: Expression [LemmaStream, U]): Parser [U] = {
     implicit def uncheckedGenerics [T [_], O] (t: T [_]): T [O] = t.asInstanceOf [T [O]]
     implicit def uncheckedGenerics2 [T [_, _], O, P] (t: T [_, _]): T [O, P] = t.asInstanceOf [T [O, P]]

     def backTrackingParser0 (e: Expression [_, _]): Parser [_] = e match {
       case Epsilon (u) => s => Success (u, s)
       case ConstExpression (l: LemmaStream, u) => (s: LemmaStream) => startsWithAndTail (l, s) .map (t => u)
       case Labelled (_, expr) => backTrackingParser0 (expr)
       case Alternatives (expressions) =>
         val parsers = expressions.map (backTrackingParser0)
         (s: LemmaStream) =>
           parsers.
             map (parser => parser (s)).
             dropWhile (_. isFailure).
             headOption.getOrElse (Failure ())
       case Pair (e1, e2) =>
         val parsers = List (e1, e2) .map (backTrackingParser0)
         (s: LemmaStream) =>
           val res = parsers.foldLeft (Success [List [U]] (Nil, s): ParseResult [List [U]]) (_. next (_))
           res.map {lst => val list = lst.reverse;  (list.head, list.tail.head) .asInstanceOf [U]}
       case Transformed (innerExpression: Expression [_, _], t) =>
         val innerParser = backTrackingParser0 (innerExpression)
         val converter = defaultSequencerHandler (t)
         (s: LemmaStream) =>
           innerParser (s) .map (converter)
       case _ => throw new IllegalArgumentException (s "backTrackingParser0 is not implemented for expression $ e")
     }
     uncheckedGenerics (backTrackingParser0 (e))
   }

The returned parser accepts a stream of lemmas as input. To parse ordinary strings, you need to convert it to a string parser using this method:

   implicit def toSimpleParser [T] (p: Parser [T]): SimpleParser [T] = (text: String) => {
     val res = p (text.split ("") .map (wordToLemma))
     if (res.tail.nonEmpty)
       throw new IllegalArgumentException (s "Cannot parse \" $ text \ "")
     res.value
   }

Such a parser can already be directly used for parsing numbers.

	 assert (parse ("twenty seven million three thousand two hundred forty five") === 27003245L)

Text Generator Design

To generate text using the same rules, you must convert the rules into a text generator. Again, we do not generate the final text, but the flow of lemmas, and then select the appropriate words using this flow, taking into account the rules for choosing the morphological form of the word (coordination, management).

A generator is a function that converts the value of any type U into a chain of lemmas. Converting expression expressions into a generator is performed using pattern matching:

   def constructGenerator [U] (tGen: TransformerGenerator [U], selGen: BooleanSelectorGenerator [U]) (e: Expression [LemmaStream, U]): Generator [U] = {
     implicit def uncheckedGenerics [T [_], O] (t: T [_]): T [O] = t.asInstanceOf [T [O]]
     implicit def uncheckedGenerics2 [T [_, _], O, P] (t: T [_, _]): T [O, P] = t.asInstanceOf [T [O, P]]
     def constructGenerator0 (e: Expression [_, _]): Generator [Any] = e match {
       case ConstExpression (l: LemmaStream, u) => (t) => l
       case Labelled (_, e1) => constructGenerator0 (e1)
       case Epsilon (_) => (t) => Iterable ()
       case Pair (e1, e2) =>
         val g1 = constructGenerator0 (e1)
         val g2 = constructGenerator0 (e2)
         (u: (_, _)) => g1 (u._1) ++ g2 (u._2)
       case BooleanAlternative (sel: SemanticSelector [U], e1, e2) =>
         val selector = selGen (sel) .asInstanceOf [Any => Boolean]
         val g1 = constructGenerator0 (e1)
         val g2 = constructGenerator0 (e2)
         (u) =>
           if (selector (u))
             g2 (u)
           else
             g1 (u)
       case MapAlternative (lst) =>
         val map = lst.map (c => (c.upper, c.lower)). toMap: Map [Any, LemmaStream]
         (u) =>
           map.getOrElse (u, throw new IllegalArgumentException (s "Cannot generate text for $ u by $ e"))
       case Transformed (innerExpression, t) =>
         val innerGen = constructGenerator0 (innerExpression)
         val transformer = tGen (t)
         (u: U) => innerGen (transformer (u))
       case _ => throw IllegalArgumentException (s "constructGenerator is not implemented for expression $ e")
     }
     constructGenerator0 (e)
   }

Choosing the form of the word

To select the form of the word constituting the numeral, it suffices to use the context consisting of one word to the left and one word to the right of the current one (if available). Signature word selection rules:

	 def betterForm (lemma: LemmaInfo, left: Option [LemmaInfo], right: Option [LemmaInfo]): WordFormAssociation

The rules themselves are set using pattern-matching above the triplet - the current lemma, the context on the left, the context on the right.

Conclusion

The declarative presentation of the grammar allows the description of formal language constructs, such as

numerals,
addresses,
phones,
various alphanumeric codes,
time, dates, time intervals / dates,
etc.

According to the declarative description, the parser and the generator are built, which carry out the direct conversion of text into semantics and semantics into text in natural language.

PS Code to article

Source: https://habr.com/ru/post/241567/

All Articles