Chomsky generators

Small preface

This text is a continuation of the post in which the author tried to describe the concepts of formal language and grammar as simply as possible and without complicated mathematical calculations. A lot of responses came to this text and the author felt obliged to write a sequel.

The following describes the formalism that generates Chomsky's grammar. Methods for defining a language with the help of generative grammars are now quite popular, especially for the computer processing of computer languages. But usually the study of generating grammars in the theory of translators ends on context-free grammars. The latter are a rather narrow special class of generators of Chomsky grammars and are usually used as a type of categorical grammars (how this is done, it will be shown below) for specifying parsers. The latter circumstance only obscures the understanding of Chomsky's approach. The further presentation is intended for those who are interested in understanding what this approach consists of.

')

Definition of Generative Grammar

A grammar is a final description of a formal language. The formal language, in turn, is an arbitrary set of strings made up of symbols of a certain finite alphabet. The arbitrariness of a set is understood here in the sense that it can be infinite, finite, or empty.

The formalism of the generators of the Chomsky grammar was introduced by Noam Chomsky in the late 50s of the last century. In a short time, this formalism gained extraordinary popularity. For some time, generative grammars were considered as a panacea - a universal approach for defining all sorts of languages, including natural languages (that is, languages that people use for everyday communication with each other). But time has shown that generative grammars for describing natural languages are not very convenient. Now generating grammars are used mainly to describe the syntax of formal languages, like programming languages and other computer languages.

Chomsky's generating grammar is defined as a set of "rules of generation" (products). Each rule is just a pair of chains (w', w'') and sets the possibility of replacing the left chain with the right one when generating chains of a language specified by the grammar. For this reason, the rules are usually written in the form w' --> w'' , specifying specifically what can be replaced. The set of rules in grammar should be non-empty and finite, and is usually denoted by the Latin P.

Chains in the rules of grammar can be composed of characters of two alphabets: the alphabet of terminal symbols (terminals) and the alphabet of non-terminal symbols (non-terminals). The alphabet of terminals is denoted by T. This alphabet actually coincides with the alphabet of the formal language that this grammar specifies. The meaning of the term "terminal" is that in the rules of grammar on the left side there can not be chains, which are composed only of terminal symbols. Therefore, if such a chain is obtained as a result of a substitution, then the next process of chain generation will stop (terminate). Non-terminal symbols are used in intermediate chain spans. The meaning of the non-terminal in the task of the chain generation algorithm can be very different and usually depends on the type of grammar in which this symbol is used. Various examples of the use of non-terminal symbols will be discussed below.

But one non-terminal symbol always has the same meaning - it means all the chains of the language. This non-terminal is called the “initial non-terminal character of the generating grammar” and is usually denoted by the Latin S (start or sentence). In each generating grammar there must necessarily be a rule to which the left-hand part consists of a single initial non-terminal, otherwise in this grammar it will be impossible to generate even one chain.

So, Chomsky's generating grammar is the quadruple G = {N, T, P, S} , where

N is the final alphabet of non-terminal characters.
T - the final alphabet of terminal symbols (coincides with the alphabet of the language specified by the grammar).
P is a finite set of generation rules.
S is the initial non-terminal grammar of G

Generative Grammar Language

Chomsky's generating grammar specifies the language by a finite number of chain substitutions from the initial non-terminal grammar based on the generation rules. We describe it a little more specifically.

The step of generating w' alpha w'' => w' beta w'' consists in replacing the alpha subchain with the beta subchain in accordance with the generation rule alpha --> beta . In this case, obviously, from the chain w' alpha w'' get the chain w' beta w'' . In other words, if there is some chain and some of its straps is the left part of some grammar rule, then we have every right to replace this left part of the rule with the right one. The final sequence of creature steps is called a creature. Zero or more spawns will be denoted by =>* . The notation alpha =>* beta indicates that the beta chain is obtained from the alpha chain by a finite number of substitutions based on the generation rules. In this notation, it may be that the substitution (generation) was not applied even once, in this case the chain alpha coincides with beta .

So, the language of the generating grammar G = {N, T, P, S} is the set of strings composed of terminal symbols and generated from the initial grammar symbol. The mathematical formula is as follows: L = {w | S =>* w} L = {w | S =>* w} .

For illustration, we give two simple examples.

An example of a very simple language

Let the language L consist of a single chain, which consists of a single symbol a . In other words, L = {a} . To generate a chain, one rule S --> a enough. The only generation that can be in a given grammar is S => a .

It should be noted that for this language it would be possible to introduce another non-terminal symbol, say, the symbol A , as well as the rules S --> A and A --> a . Then the only generation would be the following: S => A => a . Since we choose the non-terminal grammar alphabet arbitrarily, it becomes clear that even for such a simple language there are an infinite number of generating grammars defining this language.

Simple arithmetic expressions language

Consider the language A = {a+a, a+a+a, a+a+a+a, ...} . Chains of this language are sequences of characters a , separated by + characters. How to set the rules for the generation of this language? Note that each language chain starts with a character followed by one or more +a chains. Accordingly, the idea arises to first produce the symbol a , and then each chain of the language will be obtained by attaching one or more chains +a to the symbol to the right. To separate these two stages of generation from each other, we introduce a non-terminal symbol A Then, we get a grammar with the following rules: S --> aA, A --> +aA, A --> +a .

Consider, for example, how to generate a chain a+a+a . S => aA => a+aA => a+a+a . In this generation, all three rules were consistently applied: S --> aA, A --> +aA, A --> +a .

The language A contains an infinite number of chains, which means that there is no limit on the length of the chain in this language. The only way to generate chains of unlimited length is to use recursive generation rules, i.e. rules in which the left side of the rule is contained in the right part of the rule. In the example above, this rule is A --> +aA . The left side is a string of a single symbol A , which is also contained in the right side. This recursion allows you to consistently apply the same rules in the substitution, increasing, as necessary, the length of the generated chain. Recursion can be mediated through intermediate rules. For example, the rules A --> aBc, B --> deA define indirect recursion of the chain A

Grammar Classes

Noam Chomsky introduced grammar classes (and the corresponding classes of languages) by setting restrictions on the type of rules for generating grammar. Each grammar class has its own descriptive power. The descriptive power of a class of grammars can be characterized as the possibility of expressing in the rules of grammar certain syntactic relations. Consider how grammar classes define syntactic relationships.

Grammar Type 3

This class of grammars defines an algorithm for generating chains by attaching a number of terminal symbols from the right or left edge of the generated chain. Obviously, the rules for such a generation method must have the form A --> alpha B or A --> B alpha , where alpha is a string consisting of terminal symbols. In this case, if there is an intermediate (in the process of spawning) chain X1..Xn A , then the replacement in accordance with rule A --> alpha B will give the chain X1..Xn alpha B For example, for the rules S --> aaaA , A --> abcA and A --> bbb you can specify the generation S => aaaA => aaaabcA => aaaabcbbb .

The syntactic relationship that is given by type 3 grammars can be denoted by the term “to be near”. “Near” here is meant both directly next to it, if it is specified on the right side of some generation rule, and indirectly next to it, through non-terminal symbols in related generation rules.

For mathematical rigor, the string of terminal symbols in type 3 grammar rules is divided into several rules with one terminal symbol in the right-hand side. For example, if there is a rule A --> abcB , then it can be replaced with the following rules, the use of which as a result generates the same chain: A --> a A1 , A1 --> b A2 , A2 --> cB . In other words, the substitution A => abcB equivalent to the sequence of substitutions A => a A1 => ab A2 => abcB . Such grammars, where a non-terminal symbol stands to the right on the right side of the rule, are called right-line grammars, if a non-terminal character is to the left of the terminal on the right side, then the grammar is called left-linear.

For example, we define a levilinear grammar for the language A = {a+a, a+a+a, a+a+a+a, ...} . The rules of type 3 grammar, as discussed above, are: S --> aA, A --> +aA, A --> +a . Here the chains are generated by appending a pair of characters to the right. Let's change the grammar so that the characters are joined to the left, and also add non-terminal characters to add only one character each time. Get the grammar:
S --> Aa
A --> B+
B --> Aa
B --> a
This is what a chain of a+a+a looks like: S => Aa => B+a => Aa+a => B+a+a => a+a+a .

The attentive reader probably noticed that a type 3 grammar is similar to a spawning automaton, in which the role of states is played by non-terminal grammar symbols. One of the possible interpretations of this grammar is, indeed, a finite automaton.

Context-free grammar

Context-free grammars have rules of the form: A --> alpha . In the left part of the rule there should be one character (of course, non-terminal), and on the right there can be any string of terminal and non-terminal symbols (including the empty one).

KS grammars define two types of syntactic relations: the relation “to be near” and the relation “to be a part” or the relation of hierarchy. The relation of hierarchy is most natural to the human mind. It is human nature to type things, i.e. consider specific objects of their thinking as part of a general type (class). Every thing that a person thinks of is a copy of a certain class. For example, a particular chair is an instance of a class “chair” with corresponding signs. It is also common for the human mind to divide types into subtypes, moving from more concrete types to more abstract ones. Let's say a chair is a subtype of the type of furniture, furniture is a subtype of the type of an object, an object is a subtype of the type of an object, etc. The type-subtype relationship is the hierarchy relationship.

KS grammar can be interpreted as categorical grammar, i.e. grammar types. In this case, grammar symbols can be thought of as types, and the rules then define the relation of hierarchy between types. Non-terminal characters act as complex types, and terminal characters as atomic types that cannot have subtypes. Such an interpretation of KS grammar is very popular and is often used to create language translators. But in defining a class for a cop grammar, Chomsky meant something else.

Since the KS grammar is a generator, it sets the algorithm (strictly speaking, not an algorithm, but calculus is a multivariate algorithm) for generating chains of a language. The generation here is defined not only by attaching the chains to the right or left of the existing chain, but also by inserting the chain somewhere inside the existing one. The insertion is made by replacing the non-terminal symbol in the chain with a chain that stands on the right side of some rule, on the left side of which there is this non-terminal. Say, aabBBACbbb chain can be converted to aabBBaaaCbbb chain, if there is a rule A --> aaa . In this sense, the generated chain does not grow evenly from a certain edge, but, as it were, “swells” from the inside.

We illustrate this with an example. Consider the language L = {a^nb^n | n = 1, 2, 3,...} L = {a^nb^n | n = 1, 2, 3,...} . The expression a^n here means repeating n times the character a . Thus, the language L consists of a chain of the form ab, aabb, aaabbb , etc. Let us define a grammar grammar for this language. To do this, we note that you can get another language chain from the language chain by appending the character a to the first left and the character b right. Let's say if there is a chain aabb , then from it you can get a chain aaabbb . This remark gives the rule of generating S --> aSb (recall that the chains of the language are generated from the initial non-terminal grammar and, therefore, can be denoted by this symbol). There is also a special case of a chain that does not break up into smaller ones — the chain is ab . We introduce for its generation the rule S --> ab . So, the grammar of the language has rules: S --> aSb and S --> ab . Let aaabbb chain aaabbb : S => aSb => aaSbb => aaabbb .

Context-free grammar and grammar without restrictions

In the rules of the KS grammar, the non-terminal symbol on the left side of the generation rule can be changed to the right side in any place of the generated chain, wherever this symbol is encountered. But sometimes I would like to distinguish between contexts in which a symbol is located in a chain, and in some cases replace the symbol, in others - not. The rules of the COP grammar do not allow this, so special cases are necessary for such cases.

Context-sensitive grammar has rules of the form w' A w'' --> w' alpha w'' . Here w' and w'' are chains (maybe empty) made up of terminal and non-terminal grammar symbols, alpha is a non-empty string of the same symbols. In other words, the non-terminal character A is replaced by the alpha chain in the context of the w' and w'' chains.

Another class of grammars is associated with KZ grammar - non-shortening grammars. The rules in such grammars must satisfy one condition: the length of the right part must be not less than the length of the left part. Since there is a condition in the rules for KZ grammar that the alpha chain is non-empty, these grammars are also non-shortening. But the most interesting thing is that for each language defined by a short-term grammar, a short grammar can be created that defines the same language. In other words, the classes of languages defined by CG grammars and non-shorten grammars coincide.

Why is it so necessary to distinguish the class of languages given by the non-shortening grammars? The fact is that for such languages you can specify a recognition machine. A recognition grammar is constructed as follows: receiving a chain at the input, we sequentially make generations, arranging them along the length of the generated chain. Since the grammar is non-shortening, then there will be a finite set of such derivations, and if there is no match with the given chain of input, then type "no."

For grammar without restrictions on the type of rules, such a recognition algorithm cannot be constructed in the general case. The chain created can behave like an "accordion", swelling and puffing in the process of spawning. Therefore, the achievement of a certain length generated by the chain does not guarantee that further in the process of spawning the chain fed to the input will not be obtained.

Do KZ-languages really form their own class? Does this class not coincide with the class of QC-languages? In other words, is there a language for which it is guaranteed that the KS grammar cannot be specified, but that KG grammar can be defined? Answer: yes, there are such languages. An example of such a language is the following language L = {ww} . The chains of this language are made up of two repeating chains over some alphabet. We will not prove here that for this language it is impossible to construct a Q-grammar. KZ grammar can be set based on the following consideration. First, generate a chain w and a non-terminal symbol, say A , i.e. get the chain Aw . Then move the character A through the chain w , generating copies of the characters of this chain along, and then move these characters to the right. Approximately the same as it will be implemented in the example below.

Consider an example of setting grammar for the language L = {a^n^2 | n = 1, 2, 3, ...} L = {a^n^2 | n = 1, 2, 3, ...} . The chains of this language consist of the characters a , and the number of these characters is the square of the natural number: 1, 4, 9, 25, etc. We will set a grammar for this language without restrictions. Chain generation will consist of the following steps:

Generate n characters for some natural number n .
Deriving from this number of characters n^2 characters.
Convert these characters to characters a .

To implement the first stage, add the rules
S --> LS'R
S' --> AS'B
S' --> AB

The first rule is to wrap the generated chain in L and R delimiters. We will need them in the future to implement the third generation phase. The remaining two rules simply generate a chain of the form AA...ABB...B , where the number of characters A and B same.

Now it is necessary to generate n^2 characters based on the chain AA...ABB...B It makes simple reception. Move the characters B left and, at each transition through the character A , generate another character C Through the characters C, the character A can freely flow to the right, and the character B to the left. The rules for this phase are as follows:
AB --> BAC
AC --> CA
CB --> BC
When all the characters B reach the left edge, by moving the characters A , the characters C will be exactly n^2 .

Now you need to get rid of the characters L , A , B and R , and also convert the characters C to the characters a . To do this, we annihilate the symbol B as it passes to the left edge, i.e. before symbol L Accordingly, we act with the symbol A on the right edge. With the implementation of such a strategy will remain a chain of the form LCC....CR . To get rid of the characters L and R , we begin to move the left limit to the right and, when they touch, we destroy these characters. At the same time, when passing through the characters L , we convert them into characters a . .
LB --> L
AR --> R
LC --> aL
LR --> epsilon
epsilon .

aaaa :

S => LS'R => LAS'BR => LAABBR => LABACBR => LBACACBR => LACACBR => LACABCR => LACBACCR => LABCACCR => LBACCACCR => LACCACCR => LCACACCR => LCCAACCR => LCCACACR => LCCACCAR => LCCACCR => LCCCACR => LCCCCAR => LCCCCR => aLCCCR => aaLCCR => aaaLCR => aaaaLR => aaaa

Conclusion

, , , , . , « » () .

, - , . . , , , .

Source: https://habr.com/ru/post/177701/

All Articles