Practical use of Boost.Spirit

I noticed that developers have a completely polar attitude to the Boost.Spirit library: either they don’t like it terribly or they are fan of it. Of course, to describe grammar in C ++ is an occupation for an amateur. I also turned out to be such an amateur when I met Spirits. I want to show how with the help of Spirit you can quite easily solve everyday text parsing tasks.

A simple task - like two fingers

On Spirit it is very convenient to write small parsers “without departing from the cash register” - right in C ++ code. Here, for example, what do you do if you need to parse a string of the form “number-number” that sets the range of pages to print? On Spirit - one line:

  bool ok = parse (First, Last, (uint_ >> L "-" >> uint_), MinMax) && (First == Last);

More difficult ...

Moreover, it can be slightly more difficult to create more parsers. As an example, consider the mini-language parser , which I did for the Yandex.Bar API . The task was this: to facilitate the loading of plug-ins, the bar uses XML, which is rather redundant in itself. But XML is easier to load from JavaScript than parsing an arbitrary format (extensions for FireFox are written to JS, including Bar Bar).

So, what I needed was having the usual infix notation at the input:

  Hello * Interval * 60 + xpath ("number (// hello [id = '" # id # "')", World)

get a standard AST in XML format:

  <add>
     <mul>
         <value-of name = "Hello" />
         <value-of name = "Interval" />
         <value type = "number"> 60 </ f: value>
     </ f: mul>
     <xpath>
         <concat>
             number (// hello [id = '<value-of name = "id" />')
         </ f: concat>
         <value-of name = "World" />
     </ f: xpath>
 </ f: add>

At the same time, it was necessary to leave the possibility to write plain XML besides the formulas proper. But all this abundance of corner brackets, equal signs, quotes and closing tags introduced me to despondency and I decided to clear my language of these entities. I decided to write XML in this form:

  root
     child1: text
         @ attribute1: text
         @ attribute2 = formula
         grandchild
             grand-grand-child
     child2 = formula

That is, nesting is determined by the number of tabs, followed by the name of the XML node (element or attribute). Behind it is a definite symbol that determines what follows: text or formula. The text must be transmitted to the output in the "bare" form, formulas - in the form of AST.

So I have two parsers - one parses the string to get the name of the node and the text or formula. The second is parsing formulas, generating AST. I am processing the number of tabs outside with the good old std :: find_if.

String parsing Semantic actions - via Boost.

I'll start with a simpler one - parsing strings. Lines can be such:

  tag
 tag: test
 tag = formula
 = formula
 name :: (instance | widget) (setting | variable)
 name: = formula

The parser is very simple:

  bool parse_definition (string :: const_iterator & iter, string :: const_iterator end, mini_xml & root)
 {
     qi :: rule <string :: const_iterator, string (), space_type> id, any_string, scope, type;
     id% = raw [lexeme [-char _ ('@') >> alpha >> * (alnum | '_' | '-' | (':' >> alnum))]];
     any_string% = lexeme [+ char_];
     scope% = raw [lit ("widget") |  lit ("instance")];
     type% = raw [lit ("setting") |  lit ("variable")];

     return phrase_parse (iter, end,
         (
             (id >> "::" >> scope >> type) [bind (& add_identifier, ref (root), _1)] |
             (id >> ": =" >> any_string) [bind (& add_definition, ref (root), _1)] |
             (id >> ':' >> any_string) [bind (& add_raw, ref (root), _1)] |
             (id >> '=' >> any_string) [bind (& add_calculated, ref (root), _1)] |
             ('=' >> any_string) [bind (& add_expression, ref (root), _1)] |
             id [bind (& add_subnode, ref (root), _1)]
         ),
         space) && (iter == end);
 }

Using phrase_parse () instead of parse () allowed me to pass the processing of the white space (spaces, tabs, etc.) inside the expressions to Spirit. This will allow writing both “tag: text” and “tag: text”. And my code, as you can see, is freed from processing spaces - everything is done by phrase_parse (). I can only use lexeme [] where I want to turn off this behavior, and raw [] where I want to get the source text without cutting out any spaces.

By the way, let me remind you that the syntax of the rules in Spirit is as follows:

  rule [semantic_action]

That is, after each rule, you can specify in square brackets the action that will be executed if the rule “triggered”.

In my case, for each type of string, its behavior, plus at the very beginning, to simplify the subsequent code, I introduced separate rules like id, any_string. The code called when a string matches a specific rule is specified via lambda functions created with boost :: bind. The bind syntax is very simple:

  boost :: bind (function, argument, argument, ...)

As arguments, you can specify special placeholders (of type _1, _2, ...), indicating where to insert the arguments of the lambda function. At the output of each parser, one value is obtained, and it is passed as an argument to our function. For example, the parser

  id >> '=' >> any_string

will generate a couple of lines on output (in the form of boost :: fusion :: vector <string, string>), which I pass as the second parameter to my add_calculated function, which has this interface:

  void add_calculated (mini_xml & root, fusion :: vector <string, string> const &);

The first parameter I need to pass to this function is a link to root, so the call to boost :: bind looks like this:

  bind (& add_calculated, ref (root), _1)

Summarizing the rule and semantic action together:

  (id >> '=' >> any_string) [bind (& add_calculated, ref (root), _1)]

Parsing formulas. Semantic actions - via Boost. Phoenix

Let me remind you what kind of function I need to parse:

  Hello * Interval * 60 + xpath ("number (// hello [id = '" # id # "')", World)

When parsing formulas can occur:

numbers
boolean constants (true, false)
strings (in quotes)
identifiers
function calls
operations

For processing the results of the parsing, I created one big functor and in all semantic actions I use it with the help of Booost.Phoenix . Like all functors, actions differ not by name, but by the number and type of parameters.

  struct compiler
 {
     // labels are needed to distinguish functions with the same arguments from each other
     struct identifier {};  // label "identifier"
     struct function {};  // label "function"
     struct parameter {};  // label "parameter"
     struct assignment {};  // label "assignment"

     void operator () (mini_xml & node, std :: string const & id, identifier) const;  // id
     void operator () (mini_xml & node, std :: string const & id, function) const;  // function
     void operator () (mini_xml & node, std :: string const & id, parameter) const;  // function parameter
     void operator () (mini_xml & node, std :: string const & id, assignment) const;  // assignment
     void operator () (mini_xml & node, std :: string const & value, char const * type) const;  // <value>
     void operator () (mini_xml & node, mini_xml const & subnode) const;
     void operator () (mini_xml & node, mini_xml const & subnode, std :: string const & id, bool allow_join = false) const;
 };

Inside my grammar, I added a member of the class - that’s my functor:

  template <typename Iterator>
 struct expression_grammar: grammar <Iterator, mini_xml (), space_type>
 {
     expression_grammar (): expression_grammar :: base_type (expressions)
     {
          expressions = ...;
     }

     rule <Iterator, mini_xml (), space_type> expressions, ...;
     boost :: phoenix :: function <compiler> op;
 }

Ps. The type mini_xml is generated XML.
The rules for parsing identifiers, strings, numbers, and boolean constants are very simple:

  id% = raw [lexeme [alpha >> * (alnum | '_' | ('-' >> alnum))]];
 quoted_string% = lexeme ['"' >> * (char_ - '"') >> '"'];
 numeric_value% = raw [lexeme [- (char _ ('+') | char _ ('-')) >> + digit >> - (char_ ('.') >> + digit)]];
 boolean_value% = raw [lit ("true") |  lit ("false")];

All these rules output a string (for example, the name of the identifier). The operator% = in the “rule% = parser” construction allows the value generated by the parser to transfer directly to the output of the parser. Then you can use their results directly in other rules:

  string = quoted_string [op (_val, _1, "string")];
 number = numeric_value [op (_val, _1, "number")];
 boolean = boolean_value [op (_val, _1, "bool")];
 empty = lit ("empty") [op (_val, std :: string (), "empty")];
 identifier = id [op (_val, _1, compiler :: identifier ())];

As you can see, here in each case the parser is called, for example, quoted_string, and then its value is used to call the functor op. In the first case (the string rule), the functor will come to the input: as the first argument, the value that is formed (in my case, the XML tree element), as the second, the result of the parser quoting, in the third, the term string. And the functor will do all the necessary actions with the XML tree.
Defining a function is not much more difficult — in particular, because I generate XML. The parameters of the function are simply “attached” to the xml-node of the function as “children”:

  function =
     id [op (_val, _1, compiler :: function ())]
     >> '('
     >> - (parameter [op (_val, _1)]% ',')
     >> ')';

The expression “parameter [op (_val, _1)]” just attaches children to the function: the parent is passed to the functor op (a function node that has just been filled with “op (_val, _1, compiler :: function ())”) and "child" (the parameter node that generated the parser parameter).

Total, without taking into account binary and ternary operations (operations with 2 and 3 arguments, such as * / + -? :) we get the following rule:

  factor =
      function [_val = _1]
      |  boolean [_val = _1]
      |  empty [_val = _1]
      |  identifier [_val = _1]
      |  string [_val = _1]
      |  number [_val = _1]
      |  ('(' >> expression [_val = _1] >> ')')
      |  (lit ('!') [op (_val, "not", compiler :: function ())] >> factor [op (_val, _1)])
      |  (lit ('-') [op (_val, "neg", compiler :: function ()) >> factor [op (_val, _1)])
      |  ('+' >> factor [_val = _1])
      ;

When processing transactions should not forget about their priority. It is easy to implement it by “investing” the definitions of one operation into the definition of another:

  addition =
     multiplication [_val = _1]
     >> * (('+' >> multiplication [op (_val, _1, "add", true)])
         |  ('-' >> multiplication [op (_val, _1, "sub", true)])
         )
     ;

 multiplication =
     factor [_val = _1]
     >> * (('*' >> factor [op (_val, _1, "mul", true)])
         |  ('/' >> factor [op (_val, _1, "div", true)])
         )
     ;

In this case, the multiplication and division functions are decomposed before addition and subtraction, since multiplication is “embedded” in addition. This will happen because for the addition, you must first disassemble all the internal rules, including the multiplication, which I put inside. Actually, as required.

Summing it all up

The entire source code can be found here: http://download.yandex.ru/bar/tools/easierxb-src.zip (inside the archive is a project for building under Windows and MacOS).
Sample input file: http://download.yandex.ru/bar/tools/easierxb-example.zip

Source: https://habr.com/ru/post/86899/

All Articles