Motivation
From time to time posts and translated articles devoted to certain aspects of the theory of formal languages are published on Habré. Among such publications (I do not want to indicate specific works in order not to offend their authors), especially among those devoted to the description of various software tools for processing languages, inaccuracies and confusion are often found. The author is inclined to believe that one of the main reasons leading to this regrettable state of affairs is the insufficient level of understanding of the ideas underlying the theory of formal languages.
This text is intended as a popular introduction to the theory of formal languages and grammars. This theory is considered (and, I must say, fairly) quite complex and confusing. At lectures, students usually miss the exams and the more so do not cause enthusiasm. Therefore, in science there are not so many researchers in this field. Suffice it to say that for all the time, from the birth of the theory of formal grammars in the mid-50s of the last century to the present day, only two doctoral dissertations were issued in this scientific field. One of them was written in the late 60s by Alexey Vladimirovich Gladkiy, the second is already on the threshold of the new millennium - by Mati Pentus.
Further, in the most accessible form, two basic concepts of the theory of formal languages are described: formal language and formal grammar. If the test is interesting to the audience, the author makes a solemn promise to give birth to a couple more similar opuses.
')
Formal languages
In short, a formal language is a mathematical model of a real language. The real language here is understood as a certain way of communication (communication) of subjects with each other. For communication, subjects use a finite set of characters (symbols) that are pronounced (written out) in a strict temporal order, i.e. form linear sequences. Such sequences are usually called words or sentences. Thus, only so-called language communicative function, which is studied using mathematical methods. Other functions of the language are not studied here and, therefore, are not considered.
In order to better understand how formal languages are studied, it is necessary to first understand what the features of mathematical methods of study are. According to Kolmogorov et al. (Aleksandrov, AD, Kolmogorov, AN, Lavrent'ev, MA, Mathematics. Its Content, Methods and Meaning. Volume 1. M .: Publishing House of the Academy of Sciences of the USSR, 1956.), the mathematical method, to whatever it is applied, always follows two basic principles:
- Generalization (abstraction). Objects of study in mathematics are special entities that exist only in mathematics and are intended to be studied by mathematicians. Mathematical objects are formed by generalization of real objects. Studying any object, the mathematician notices only some of his properties, and is distracted from the rest. Thus, the abstract mathematical object “number” can in reality denote the number of geese in a pond or the number of molecules in a drop of water; The main thing is that geese and water molecules can be
speak as aggregates. One important property follows from such an “idealization” of real objects: mathematics often operates with infinite sets, whereas in reality there are no such sets. - The rigor of reasoning. In science, it is customary to verify the truth of one or another reasoning to verify their results with what actually exists, i.e. conduct experiments. In mathematics, this criterion for testing reasoning for truth does not work. Therefore, the conclusions are not verified experimentally, but it is customary to prove their validity by strict, obeying certain rules, reasoning. These arguments are called evidence, and evidence is the only way to substantiate the validity of a statement.
Thus, in order to study languages using mathematical methods, you must first select its properties from the language, which are important for learning, and then these properties are strictly defined. The abstraction obtained in this way will be called a formal language - a mathematical model of a real language. The content of a particular mathematical model depends on which properties are important for studying, i.e. what is planned at the moment to highlight and study.
As a well-known example of such a mathematical abstraction, one can cite a model known by the name “bag of words”, which is incongruous for the Russian ear. In this model, the texts of natural language are explored (that is, one of those languages that people use in the process of everyday communication with each other). The main object of the word bag model is a word supplied with a single attribute, the frequency of occurrence of this word in the source text. The model does not take into account how words are placed next to each other, only how many times each word occurs in the text. A bag of words is used in machine learning based on texts as one of the main objects of study.
But in the theory of formal languages, it is important to study the laws of the arrangement of words next to each other, i.e. syntactic properties of texts. For this, the bag pattern of words looks poor. Therefore, a formal language is defined as a set of sequences composed of elements of a finite alphabet. We define it more strictly.
The alphabet is a finite non-empty set of elements. These elements will be called characters. For the designation of the alphabet we will usually use the Latin V, and for the designation of the symbols of the alphabet - the initial lowercase letters of the Latin alphabet. For example, the expression
V = {a,b}
denotes an alphabet of two characters a and b.
The string is a finite sequence of characters. For example,
abc
is a string of three characters. Often, when marking chains in characters, indexes are used. The chains themselves are designated by lowercase characters of the end of the Greek alphabet. For example,
omega = a1...an
is a string of n characters. The chain may be empty, i.e. do not contain a single character. Such chains will be denoted by the Greek letter epsilon.
Finally, the formal language L over the alphabet V is an arbitrary set of strings composed of the characters of the alphabet V. Arbitality here means the fact that the language can be empty, that is, not have any chains, and infinite, i.e. composed of an infinite number of chains. The latter fact is often puzzling: Is there any real languages that contain an infinite number of chains? Generally speaking, in nature, of course. But here we use infinity as the possibility of forming chains of unlimited length. For example, a language that consists of the possible variable names of a C ++ programming language is infinite. After all, variable names in C ++ are not limited in length, so potentially such names can be infinitely many. In reality, of course, long variable names do not make much sense for us. by the end of reading such a name you already forget its beginning. But as a potential opportunity to set variables unlimited in length, this property looks useful.
So, formal languages are simply sets of chains made up of symbols of some finite alphabet. But the question arises: how can one define a formal language? If a language is finite, then you can simply write out all its chains one by one (of course, you can think about whether it makes sense to write out chains of a language that has at least ten thousand elements and, in general, does it make sense to write out this?). What if the language is infinite, how to set it? At this point, the grammar comes on the scene.
Formal grammar
The way the language is specified is called the grammar of this language. Thus, we call grammar any method of defining a language. For example, the grammar
L = {a^nb^n}
(here n is a positive integer) defines the language L, consisting of chains of the form
ab, aabb, aaabbb
, etc. The language L is an infinite number of chains, but nevertheless, its grammar (description) consists of only 10 characters, i.e. is finite.
The purpose of the grammar is the task of the language. This task must be finite, otherwise the person will not be able to understand this grammar. But how does the end task describe infinite aggregates? This is possible only if the structure of all the chains of the language is based on uniform principles, of which there are a finite number. In the example above, the following principle appears as such a principle: “each string of a language begins with the characters a, followed by the same number of characters b”. If a language is an infinite set of randomly typed chains, the structure of which does not obey the same principles, then obviously no grammar can be invented for such a language. And here is another question, whether or not to consider such an aggregate language. For the purpose of mathematical rigor and uniformity of approach, such aggregates are usually considered as language.
So, the grammar of the language describes the laws of the internal structure of its chains. Such laws are usually called syntactic laws. Thus, it is possible to rephrase the definition of grammar, as the final way of describing the syntactic laws of a language. For practice, not just grammars are interesting, but grammars that can be specified within a single approach (formalism or paradigm). In other words, on the basis of a common language (metalanguage) the description of the grammars of all formal languages. Then you can come up with an algorithm for a computer that will take as input a grammar description made on this metalanguage, and do something with the chains of the language.
Such grammar description paradigms are called syntactic theories. Formal grammar is a mathematical model of grammar, described in the framework of some syntactic theory. There are quite a few such theories. The most famous metalanguage for defining grammars is, of course, the generators of Chomsky's grammar. But there are other formalisms. One of these, neighborhood grammars, will be described below.
From the algorithmic point of view, grammar can be divided by the method of defining the language. There are three main such methods (types of grammar):
- Recognizing grammar. These grammars are devices (algorithms) that feed a chain of language to the input, and the device prints “Yes” at the output if the chain belongs to the language, and “No” otherwise.
- Generative grammar. This kind of device is used to generate chains of languages on demand. Figuratively speaking, when you click a button, a certain chain of language will be generated.
- Enumerating grammar. Such grammars type one after another all the chains of the language. Obviously, if a language consists of an infinite number of chains, the enumeration process will never stop. Although, of course, it can be stopped forcibly at the right time, for example, when the desired chain is printed.
Of interest is the question of converting the types of grammar into each other. Is it possible, having a generating grammar, to construct, say, enumerating? The answer is yes, you can. To do this, it is enough to generate chains, ordering them, say, by the length and order of characters. But to turn the enumerating grammar into recognizing in the general case is impossible. You can use the following method. Having received a chain as input, start the process of enumerating the chains and wait for the enumerating grammar to print this chain or not. If such a chain is printed, then we end the enumeration process and print “Yes”. If the chain belongs to the language, then it will necessarily be printed and, thus, recognized. But, if the chain does not belong to the language, then the recognition process will continue indefinitely. The program recognizes grammar zatsiklitsya. In this sense, the power of recognizing grammars is less than the power of generators and enumerators. This should be borne in mind when comparing Chomsky's grammar grammar and Turing recognition machines.
Neighboring Grammar
In the mid-60s, the Soviet mathematician Yuly Anatolyevich Schreider proposed a simple way to describe the syntax of languages based on the so-called. neighborhood grammars. For each symbol of a language, a finite number of its “neighborhoods” is specified — chains containing the given symbol (center of the neighborhood) somewhere inside. A set of such neighborhoods for each character of the alphabet of a language is called a neighborhood grammar. A chain is considered to belong to a language defined by a neighborhood grammar if each character of this chain is included in it along with some of its own neighborhoods.
As an example, consider the language
A = {a+a, a+a+a, a+a+a+a,...}
. This language is the simplest model of the language of arithmetic expressions, where the role of numbers is played by the symbol "a", and the role of operations is played by the symbol "+". We compose a local grammar for this language. Set the neighborhood for the symbol "a". The “a” character can be found in the strings of the A language in three syntactic contexts: at the beginning, between two “+” characters, and at the end. To indicate the beginning and end of the chain, we introduce the pseudo-character "#". Then the neighborhood of the symbol "a" will be as follows:
#a+, +a+, +a#
. Usually, to highlight the center of a neighborhood, this symbol in the chain is underlined (after all, there may be other such symbols in the chain that are not the center!), We will not do this here in the absence of a simple technical possibility. The “+” symbol is found only between two “a” symbols, so one neighborhood is specified for it, the
a+a
string.
Consider the chain
a+a+a
and check if it belongs to the language. The first character “a” of the chain enters it along with the neighborhood
#a+
. The second "+" symbol is included in the chain along with the neighborhood
a+a
. Such an entry can be checked for the remaining characters of the chain, i.e. This chain belongs to the language, as expected. But, for example, the chain
a+aa
does not belong to the language A, since the last and the penultimate characters of "a" do not have neighborhoods with which they belong to this chain.
Not every language can be described by a neighborhood grammar. Consider, for example, the language B, whose chains begin either with the character “0” or with the character “1”. In the latter case, the characters “a” and “b” can go further in the chain. If the chain starts from zero, then only the characters "a" can go. It is not difficult to prove that no neighborhood grammar can be invented for this language. The legitimacy of the entry of the character "b" in the chain due to its first character. For any neighborhood grammar in which the relationship between the characters “b” and “1” is specified, it will be possible to pick up a sufficiently long chain so that every neighborhood of the character “b” does not reach the beginning of the chain. Then it will be possible to substitute the symbol “0” in the beginning and the chain will belong to language A, which does not correspond to our intuitive ideas about the syntactic structure of the chains of this language.
On the other hand, it is easy to build a finite automaton that recognizes this language. Hence, the class of languages, which are described by neighborhood grammars, is already a class of automaton languages. Languages asked by neighborhood grammars will be called Schrader languages. Thus, in the hierarchy of languages, one can distinguish the class of Schrader languages, which is a subclass of automaton languages.
We can say that Schrader languages define one simple syntactic relation - “to be near” or the relation of immediate precedence. The relation of long-distance precedence (which is obviously present in the B language) cannot be specified by the surrounding grammar. But, if you visualize the syntactic relations in the chains of the language, then for the relationship diagrams into which such chains turn, you can think of a neighborhood grammar.