Prolog - parse and language problems

Grammar analysis is a topic that every programmer should know and orient. Precisely because we apply it every day. Yes, we do not write new languages or edit grammars every day, but we use regular expressions, think about complexity and computability, think about the number of lines of code that are directly related to grammars.

The purpose of this article is to attempt to show connections in various fields of knowledge, such as programming and mathematics, philosophy and logic, as well as to demonstrate in action one of the most successful areas of the Prolog language application - grammatical analysis.

')
Sometimes it seems that in different sciences different terms are used to define the same object. Then this inescapable feeling arises, somehow to unite everything, to classify what is quite difficult to do in practice. After all, different scientific disciplines have different objects of study, different goals, methodologies and, most unpleasantly, specific terminology.

This article will be considered one of the interdisciplinary objects - language . In principle, it is clear why so many sciences are engaged in the language, because there would be no language - there would be no knowledge, there would be no reasonable person.

History tour

Philosophy

Oddly enough, first you want to mention the philosophy, namely one of its directions of the XX century Analytical philosophy . I think many programmers, mathematicians, do not take it seriously enough, although it would be worth it. Many include learning new programming languages in self-education, in my opinion, the inclusion of philosophy is to some extent necessary. No wonder that PHD is a doctor of philosophical sciences.

The specificity of analytic philosophy is that philosophers have focused not only on the consistency of statements and conclusions, but also on the expressiveness of truth in principle and the method of its transmission. Simply put, the object of their study was a formal language. Then the concepts of predicate logic and proof were introduced. It is not surprising that such philosophers Russell, Frege, laid the foundation of mathematical logic and are rightly considered mathematicians.

Mathematical logic

In general, the need for the intervention of philosophy in mathematics has matured in the late 19th century. As is well known, Euler proved a large number of theorems and statements, but at the same time he did not suffer from formalism at all. That is, Euler could easily summarize the divergent rows without even introducing a definition on what basis this could be done. No, he cited comments, reasoning, even more than Fermat, who left only notes in the margins, but still, such reasoning would not have been taken as evidence. By the end of the 19th century, a crisis had matured in mathematics: there were such statements as the “continuum hypothesis”, the problem of choice. These statements could not be resolved unequivocally, since they evoked different associations with different people, in other words, some believed in the loyalty of the statements, and some did not. Then mathematics took a course on formalism, after which basic problems emerged (relating real objects to numbers and other mathematical objects, checking axioms), but this was left to the discretion of philosophers. From this point on, mathematics could be called the language of science.

Mathematical logic immediately chose as its object not numbers or other objects, like other mathematical disciplines, but words and statements of some formal language. In essence, the definition was given to proof , standard logical conclusions MP (Modus Ponens) and Gen (Generalization) were defined, and formal mathematical theories (numbers, groups) were described. In general, we got the mathematics that we see now: with formal proofs and a posteriori checks (a priori, only axioms remain).

Undoubtedly for mathematics, this immediately gave results. Many programmers would have been distressed, but now all the evidence must be rewritten into a new language, but the mathematics of “refactoring” did not get scared and all were slowly rewritten by executing the program of Hilbert . In general, here we come very close to programming and to the collapse of formal mathematics. In the 1930s, the 1st mathematician appeared, who began to examine evidence and theorems, not from a position of meaning, but from a position of senseless words and statements, subject to grammatical rules (axioms and rules of inference). He set the rules and their properties more important than the words themselves, which was not quite usual for mathematics. And having studied the properties of the rules, he generated a statement for which it was impossible to obtain / deduce proof. It was Gödel .

It is important to note here that in mathematical logic the notion of derivability means the same as provability. Since we all know that proving theorems is very difficult and not subject to a machine, we can assume that the algorithmic task of checking that a word can be derived from a specific grammar is also not solvable. That is, in general, what the program will bring to the console may not be known :) In the simplest case, it is unknown whether the program will end or not

Programming (Computer science)

It is not known why it was Turing machines that were chosen by default to assess the complexity of the algorithms. There are at least 3 identical mathematical models: Markov algorithms , Gödel functions, and Partially recursive functions (more applicable in mathematical logic).

Linguistics

Of course, speaking of language, it is impossible not to mention the achievements of linguists in this field. Oddly enough, programmers use Chomsky's classification when describing new grammars and understanding complexity. So regular grammars are described by non-deterministic automata, which can be transformed into deterministic, context-free automata with a stack, and context-sensitive machines may require a full Turing machine. A useful result in the theory of algorithms is that grammars of type 0 (the most complex) are necessary and sufficient for recognition by the Turing machine.

To thoroughly and thoroughly understand all the terms listed above, a “dragon book” will undoubtedly be required, as well as quite good books on discrete mathematics and mathematical logic, which naturally cannot be disclosed in one article. The article assumes that the reader has an understanding of these concepts.

The description of the formal grammar in Prolog

Take the definition of a formal language:
In mathematical logic and computer science, a formal language is a set of finite words (strings, chains) over a finite alphabet [ Wikipedia ].

First of all, we will not be interested in the language itself, but in the way it is assigned, more precisely, the grammar:
A formal grammar, or simply a grammar in the theory of formal languages, is a way of describing a formal language, that is, distinguishing a certain subset from the set of all words of a certain finite alphabet. Distinguish between generators and recognizers (or analytic) grammar - the first set the rules with which you can build any word of the language, and the second allow you to determine by this word whether it enters the language or not [ Wikipedia ].

Take the standard description of the grammar of the language of palindromes of even length:

S -> a S a. S.

Small letters describe terminal symbols, and large products. Surprisingly, how this syntax is initially close to Prolog:

  S :- a, S, a. S.

Of course, the semantic load is completely different, but there is an absolute similarity in the execution of the program. This grammar is context-free and if for its analysis to use the LL-analyzer, the analysis execution steps will fully coincide with the steps of the Prolog execution. That is, the rules will be viewed from left to right, unlike the LR analyzer. The main difference between the execution of programs on the Prologue is that backtracking is not limited. One of the advantages of LL-analyzers (Antlr) before LR-analyzers (yacc) is the ease of debugging, since the products can be viewed as functions and interpret their call on the stack, the minus is the smaller number of described languages.

Prolog is a domain-oriented language, that is, it allows you to create special notations for each language, and the task of describing grammars is no exception. The syntax is quite simple:

  s --> [a], s, [a]. s --> [].

There are terminal symbols in square brackets (there may be a lot of them), outside the brackets, literals are products. Such a syntax makes it possible to unambiguously translate only a grammar grammar, but with a slight extension of this syntax, the analyzer can solve problems that require a full Turing machine, that is, languages of type 0 grammar are covered.

A couple of examples of KS grammar described in the prologue.

  s --> []. s --> [a], s, [b]. sentence --> noun_phrase, verb_phrase. noun_phrase --> det, noun. verb_phrase --> verb, noun_phrase. det --> [the]. det --> [a]. noun --> [cat]. noun --> [bat]. verb --> [eats].

Implementation of the analyzer and product of formal languages in Prolog

The most interesting point is the implementation of the analyzer, because this is one of those tasks that Prolog is implemented quite simply. The key point (disadvantage) is that Prolog cannot work with reading from a file using pure predicates, so all the content needs to be loaded into the list (this can be a list of tokens or letters).

For implementation, “delta lists” are used - X \ [], the left part is unvisited elements, the right part is scanned. Now imagine that each rule moves the token from the left side of the differential list to the right. Under backkreenge, the token is moved back: from right to left. We write this in the form of the simplest rules of the Prologue:

  s(L, L). s(L, L3) :- L=[a|L1], s(L1, L2), L3=[b|L2].

In order to determine whether the word aabb belongs to the language, it is necessary to run the predicate: -s ([a, a, b, b], []). And the answer will be yes. Like any pure program on the prologue, it does not have one input and output, so the same grammar can be used as a generator !
At the request: -s (X, []) we will receive: [], [a, b], [a, a, b, b], .... You can even specify queries with lost data ( data recovery task):: -s ([a, X, Y, b], []). -> X = a, Y = b. In general, we get all this for free :)

Obviously, the process of generating predicates from grammar rules can be automated, and most interpreters / compilers do this. Although for those Prolog implementations that do not support DCG (Definite Clause Grammar) - this can be done independently. Especially since this is an excellent task for those who study Prolog.

  :- op(1200, xfx, '-->'). :- op(200, xfx, '\\'). dcg_nonterminal(X) :- list(X), !, fail. dcg_nonterminal(_). dcg_terminals(Xs) :- list(Xs). dcg_parse(A, Tokens) :- dcg_nonterminal(A), (A --> B), dcg_parse(B, Tokens). dcg_parse((A, B), Tokens \\ Xs) :- dcg_parse(A, Tokens \\ Tokens1), dcg_parse(B, Tokens1 \\ Xs). dcg_parse(A, Tokens) :- dcg_terminals(A), dcg_connect(A, Tokens). dcg_parse({A}, Xs \\ Xs) :- call(A). dcg_connect([], Xs \\ Xs). dcg_connect([W | Ws], [W | Xs] \\ Ys) :- dcg_connect(Ws, Xs \\ Ys). % Example :- dcg_parse(s, [a,b] \\ []).

I think this code is so complete that it does not need comments, and the first 3 rules are just checks.

Description of complex grammars and the result tree in Prolog

It is clear that no analyzer works only with QC grammars, all analyzers make changes to cover the greatest number of languages. The easiest way to do this is in the DCG rules in Prolog. All that is needed is to add context, namely parameters to the rules (I don’t understand why all LL analyzers don’t do that). Take a short grammar (a..ab..bc..c - 3 letters with the same number of repetitions).

  rule(a:X, b : Y, c : Z) --> [a], rule(a:X+1, b : Y, c : Z). rule(a:X, b : Y, c : Z) --> [b], rule(a:X, b : Y+1, c : Z). rule(a:X, b : Y, c : Z) --> [c], rule(a:X, b : Y, c : Z+1). rule(a:N, b : N, c : N) --> [].

Extremely clear and visual. For those who are tormented by the question of efficiency, this grammar will be analyzed by Prolog with the same efficiency as the LA (1) analyzer. If you cut off after [a] (terminals), then the memory cost will be the same. Due to the fact that the predicate plus used in the rules is not declarative (in the previous article, an example of a declaring plus was cited) this grammar is not generative. That is, you cannot write a query: -rule (a: 5, b: _, c: _, Res, []).

As many have already noticed, in the syntax there are no familiar to us in BNF symbols + and *, which are often used in the description of grammars. But it is easy to fix with a powerful meta-rule tool in hand.

  '*'(Rule) --> []. '*'(Rule) --> Rule, '*'(Rule). '+'(Rule) --> Rule, '*'(Rule). tree -> '+'(branch). forest -> '*'(tree).

It looks, of course, not familiar, so this syntax did not take root in the DCG. Restrictions of the Prolog parser, which only support unary prefix and binary operators, interfere with giving expressions a more or less natural look.

Prolog result tree

Recognizing whether a word belongs to a language or not is a rather rare task in programming. More often we are interested in the grammar parse tree for subsequent interpretation / validation / analysis. Here is an example of a typical grammatical analysis in Prolog (just like in an English class):

  sentence(s(NP,VP)) --> noun_phrase(NP), verb_phrase(VP). noun_phrase(np(D,N)) --> det(D), noun(N). verb_phrase(vp(V,NP)) --> verb(V), noun_phrase(NP). det(d(the)) --> [the]. det(d(a)) --> [a]. noun(n(bat)) --> [bat]. noun(n(cat)) --> [cat]. verb(v(eats)) --> [eats]. | ?- sentence(Parse_tree, [the,bat,eats,a,cat], []). Parse_tree = s(np(d(the),n(bat)),vp(v(eats),np(d(a),n(cat)))) ? ;

As we can see for the output analysis, the context is used, which is represented by the usual Prolog variables, and the variables in the Prolog are initialized, although they are initialized once, but are well suited for the output parameters of the function.
In order to fully represent the power of the analyzer, I will give an example of a grammar of numerical expressions, the result tree of which is the answer of a numerical expression.

  expr(Res) --> expr(X), ['+'], summand(Y), {Res is X + Y}. expr(Res) --> expr(X), ['-'], summand(Y), {Res is X + Y}. summand(Res) --> summand(X), ['*'], multiplicator(Y), {Res is X / Y}. summand(Res) --> summand(X), ['/'], multiplicator(Y), {Res is X / Y}. summand(Res) --> multiplicator(Res). multiplicator(X) --> number(X). multiplicator(X) --> ['('], number(X), [')'].

In principle, using metapredicates and {} to call Prolog's embedded predicates, we get a complete Turing machine. In the methods of interpreting programs on Prolog, one can draw a parallel with the deterministic Turing Machine and the non-deterministic. The deterministic Turing Machine will interpret the Prolog program in the classic version, that is, selecting rules from top to bottom in a strict order. The nondeterministic analogue of the interpretation will look for the best match of the rule and follow this branch (as the NMT chooses an extremely successful next command in order to achieve the final goal of the program completion).

The specifics of the analyzer Prolog

Summarizing the analysis of the DCG parser on the Prolog, we give + and - its use:
1. + The expressibility of a large number of languages (up to type 0).
2. + Convenience of writing context-dependent languages, building an output tree, meta-rules.
3. + Trace, console output mechanisms are available as standard Prolog predicates and all Prolog tools can be used.
4. - A simple rearrangement of the rules may cause the program to loop. Example:
expr (Res) -> expr (X), ['+'], expr (Y).
A rule of this kind will be correctly perceived by the analyzer and sometimes produce the expected result, but most often it will loop the program or use a huge amount of resources during backtracking.
5. - Using clipping as a backtracking constraint can unexpectedly affect program logic.
6. - It is quite difficult to limit the use of backtracking and Look Ahead characters.
7. - All the contents of the analyzed text in memory (array of tokens or letters) is required.

Bibliography

1. Aho. Compilers: principles, technologies and tools
2. Chen and Lee. Automatic proof of theorems.
3. Mendelssohn. Mathematical logic.

PS (Last word): Does anyone know how many articles you need to write about Prologue so that the Prolog blog appears on Habré :)

Source: https://habr.com/ru/post/136742/

All Articles