This article was conceived as a visual comparison of two similar libraries for creating parsers:
Boost Spirit for C ++ and
Parsec for Haskell. Then I decided that it was better to break the article into 3 parts. In the first part, I will tell you how to write context-free grammar to describe the ini-file contents.
ini files
Files with the ini extension are widely distributed not only in the Windows world, but also in other systems (for example, php.ini). The format of the ini-file is very simple: the file is divided into sections, in each section there can be an arbitrary number of records of the form “parameter = value”. The names of the parameters in different sections may coincide.
[_1]
1=1
2=2
[_2]
1=1
2=2
Each parameter can be addressed through the section name and the parameter name: something like
'_1'.'2'
.
Ini-files provide comments - lines starting with ";".
We build grammar
Let's try to describe this format as a context free grammar in the
extended Backus-Naur notation (I hope that it will be clear even to those who are not familiar with it).
')
We describe what the ini file is. To do this, we describe all the constructions from the most complex (actually the ini-file itself) to the simplest (what is the identifier). Each such construction is associated with a special designation (
non-terminal ), which is defined through other non-terminals and ordinary symbols (terminals), which I will
set in quotes.
- Data ini-file (inidata) contain several sections (curly brackets mean to repeat any number of times).
inidata = {section} .
- A section consists of the section name, enclosed in square brackets, followed by several entries (parameters) from the next line.
section = "[", ident, "]", "\n", {entry} .
- The entry consists of the parameter name, the "=" sign, the parameter value and ends with the end of the line.
entry = ident, "=", value, "\n" .
- We define what an identifier is: everything that consists of letters, numbers or signs "_.,: () {} - # @ & * |" (in fact, there may be other characters).
ident = {letter | digit | "_" | "." | "," | ":" | "(" | ")" | "{" | "}" | "-" | "#" | "@" | "&" |"*" | "|"} .
This definition is not entirely true, since The identifier must consist of at least one character. Redo it like this:
ident = identChar, {identChar} .
identChar = letter | digit | "_" | "." | "," | ":" | "(" | ")" | "{" | "}" | "-" | "#" | "@" | "&" |"*" | "|" .
- Now we define what is the value: everything except the end of the line (for brevity, the notation no was extended)
value = {not "\n"} .
It remains to take into account that some parsers / people like to put extra spaces and blank lines.
For this we need to enter two more non-terminals: whitespace characters used in the string and just whitespace characters.
stringSpaces = {" " | "\t"} .
spaces = {" " | "\t" | "\n" | "\r"} .
Spaces can be almost anywhere. Therefore, a little bit correct the grammar:
inidata = spaces, {section} .
section = "[", ident, "]", stringSpaces, "\n", {entry} .
entry = ident, stringSpaces, "=", stringSpaces, value, "\n", spaces .
ident = identChar, {identChar} .
identChar = letter | digit | "_" | "." | "," | ":" | "(" | ")" | "{" | "}" | "-" | "#" | "@" | "&" |"*" | "|" .
value = {not "\n"} .
stringSpaces = {" " | "\t"} .
spaces = {" " | "\t" | "\n" | "\r"} .
Here, in general, and that concerns grammar =).
Someone probably noticed that I did not say anything about the comments. I have not forgotten - it’s just easier to cut them with “hands” =) (as an exercise, you can correct the grammar so that it takes comments into account).
Important: I cheated a little and built the grammar so that it did not have a left recursion. Both of the libraries I am considering build a
recursive descending parser that is vulnerable to left recursion. Before you use these libraries in real projects, make sure that you understand what it is and how to deal with it =).
Now you can compare the use of this grammar to build a parser in
C ++ and
Haskell .
Ps. Thanks to
maxshopen for the idea of ​​putting this article on the “Development” blog.