📜 ⬆️ ⬇️

We create a parser for ini-files. Theory

This article was conceived as a visual comparison of two similar libraries for creating parsers: Boost Spirit for C ++ and Parsec for Haskell. Then I decided that it was better to break the article into 3 parts. In the first part, I will tell you how to write context-free grammar to describe the ini-file contents.

ini files


Files with the ini extension are widely distributed not only in the Windows world, but also in other systems (for example, php.ini). The format of the ini-file is very simple: the file is divided into sections, in each section there can be an arbitrary number of records of the form “parameter = value”. The names of the parameters in different sections may coincide.
[_1]
1=1
2=2

[_2]
1=1
2=2

Each parameter can be addressed through the section name and the parameter name: something like '_1'.'2' .

Ini-files provide comments - lines starting with ";".

We build grammar


Let's try to describe this format as a context free grammar in the extended Backus-Naur notation (I hope that it will be clear even to those who are not familiar with it).
')
We describe what the ini file is. To do this, we describe all the constructions from the most complex (actually the ini-file itself) to the simplest (what is the identifier). Each such construction is associated with a special designation ( non-terminal ), which is defined through other non-terminals and ordinary symbols (terminals), which I will
set in quotes.

It remains to take into account that some parsers / people like to put extra spaces and blank lines.
For this we need to enter two more non-terminals: whitespace characters used in the string and just whitespace characters.
stringSpaces = {" " | "\t"} .
spaces = {" " | "\t" | "\n" | "\r"} .

Spaces can be almost anywhere. Therefore, a little bit correct the grammar:
inidata = spaces, {section} .
section = "[", ident, "]", stringSpaces, "\n", {entry} .
entry = ident, stringSpaces, "=", stringSpaces, value, "\n", spaces .
ident = identChar, {identChar} .
identChar = letter | digit | "_" | "." | "," | ":" | "(" | ")" | "{" | "}" | "-" | "#" | "@" | "&" |"*" | "|" .
value = {not "\n"} .
stringSpaces = {" " | "\t"} .
spaces = {" " | "\t" | "\n" | "\r"} .


Here, in general, and that concerns grammar =).

Someone probably noticed that I did not say anything about the comments. I have not forgotten - it’s just easier to cut them with “hands” =) (as an exercise, you can correct the grammar so that it takes comments into account).

Important: I cheated a little and built the grammar so that it did not have a left recursion. Both of the libraries I am considering build a recursive descending parser that is vulnerable to left recursion. Before you use these libraries in real projects, make sure that you understand what it is and how to deal with it =).

Now you can compare the use of this grammar to build a parser in C ++ and Haskell .

Ps. Thanks to maxshopen for the idea of ​​putting this article on the “Development” blog.

Source: https://habr.com/ru/post/50973/


All Articles