Parboiled Pro (Part 3)

Part 3: Data Extraction

In this article we will build a parser for the format of configuration files already described by us earlier. We also implement a small DSL for easy access to the elements of the resulting tree. From this article, you will also learn about the types of rules, the actions of the parser, as well as the “dark matter” Parboiled - a stack of values.

Cycle structure:
')

Value Stack

Before extracting any data using the rules, you should tell a little about one of the concepts, which is implemented in Parboiled. It is called the Value Stack and it can not quite correctly be translated as a “stack of values”. It represents, indeed, a stack that is modified by the parser actions, the results of the rules are parsed into it and extracted from it. It is to this stack that we must give a hint when declaring recursive rules. In order for the elements to be placed on the stack, they must be explicitly captured, which will be reflected in the form of your rules. Types of rules also reflect the number of captured items and their type. Stack items can have a different type, and the typification of a stack of values is checked at compile time.

Types of rules

In Parboiled2, the following types of rules exist:

Rule0 - simply answers the question "did it Rule0 ?" Without changing the contents of the stack.
Rule1 - pushes one object Rule1 stack of values.
Rule2 - Rule2 two objects Rule2 stack of values.
RuleN - pushes N objects to the stack of values using the semantics of the Shapeless library. You do not need to know Shapeless to work with Parboiled2 (although it will be useful).
PopRule - retrieves values from the stack, without placing new values there.

If you wish, you can declare your aliases for types, as it was in Parboiled1. For example, in the Parboiled2 code, Rule2 is implemented:

 type Rule2[+A, +B] = RuleN[A :: B :: HNil]

In Parboiled1, for each number of arguments from 0 to 7, there was a separate type that created the so-called “ Rule7 problem”: the Rule8 class Rule8 longer exists and putting eight items on the stack of values does not work, even if you really want to. There are various ways to work around this problem, and I’ll tell you about one of them in the next article.

Parser actions

Parser actions should be called actions on the stack, since they allow you to extract data from the matching rules, convert them, and if you are highly corrupted, produce side effects with them (which may in some cases be really necessary, for example, if and the amount of data retrieved is not known in advance). Using actions, you can form abstract syntax trees ( AST ), you can use them to calculate "on the spot", as is done in the example with a calculator .

Exciting stories

To perform some useful action on the data, we need to capture them first. There is a capture function for this: it matches the data with the rule and, if successful, puts it on a stack of values.

Suppose we have a rule of the type Rule0 , from which we want to get at least something:

 def User: Rule0 = rule { FirstName ~ Separator ~ LastName }

We need to decide what we will capture, although it is obvious that the separator is of no artistic value:

 def User: Rule2[String, String] = rule { capture(FirstName) ~ Separator ~ capture(LastName) }

From this point on, our rule is no longer Rule0 , but Rule2 , since it captures and sends two lines to the stack of values. However, the type can not be specified, the compiler will understand everything himself.

Operator of action ~>

... or the operator that you have to use the most. It takes lambda as the right parameter and sends the objects captured from the stack to the input, thereby allowing the lambda to work with these objects. Then, if you wish, you can send the values back to the stack, or create a node from them for your AST - choose according to your taste. In any case, in order for the action to take place, you must first capture the data on the stack using the capture function. Depending on the type of the return value, different forms of the ~> operator are used, which makes the use of this operator simple and intuitive.

In Parboiled1, the capture was performed implicitly, which I find very uncomfortable.

Now a little more about the lambda. Its signature depends on the number and typification of the captured objects, and a lambda can capture no more than 22 arguments at a time . The types of lambda arguments correspond to the types of values removed from the stack, and the types of return values correspond to the types of values put back on the stack.

For example, let's try to extract at least one integer from the parser:

 def UnsignedInteger: Rule1[Int] = rule { capture(Digit.+) ~> (numStr => numStr.toInt) }

In this situation, the use of a brand Skalovsky placeholder is encouraged:

 def UnsignedInteger: Rule1[Int] = rule { capture(Digit.+) ~> (_.toInt) }

Here our lambda is of type (String => Int) , which determines the type of our rule - Rule1[Int] . It is allowed to apply the ~> operator to a typed rule as well, for example, the following rule matches an integer, but pushes its double value onto the stack:

 def TwoTimesLarger = rule { UnsignedInteger ~> (i => i * 2) }

The type of the TwoTimesLarger rule will remain Rule1[Int] , only a different value will be on the stack.

Explicitly specifying the type of arguments to lambda functions is not the best idea (at least at the time of this writing). In the Scala compiler, there is a very unpleasant bug that will not allow your code to compile properly.

We dealt with one argument, but what if there are several of them? How will lambda behave? Simple and predictable: the first parameter corresponds to the highest value on the stack, the second parameter to the second from the top, and so on. Since the procedure of capturing subexpressions is performed from right to left , the order of the arguments of the lambda function corresponds to the order of writing the capture operations:

 def UserWithLambda: Rule2[String, String] = rule { capture(FirstName) ~ Separator ~ capture(LastName) ~> ((firstName, lastName) => ...) }

Thanks to the action operator, we can reduce the number of values on the stack:

 def UserName = rule { User ~> ((firstName, lastName) => s"$firstName $lastName") }

In the example above, the original type of the User rule was Rule2[String, String] , applying the lambda function to it, we created a new UserFirstName rule with the type Rule1[String] .

Lambda is not required to take all the parameters from the stack; you can limit yourself to the last N values (remember that lambda takes arguments from the end of the stack):

 (foo: Rule2[Int, String]) ~> (_.toDouble) // foo: Rule2[Int, Double].

Nothing prevents us from trying to feed the lambda function rule, which has no arguments, with a predictable result:

 (foo: Rule0) ~> (() => 42) // foo: Rule1[Int].

Parboiled2 has more powerful tools, for example, the ability to immediately return a group of values from a lambda to the stack:

 (foo: Rule1[Event]) ~> (e => e::DateTime.now()::"localhost"::HNil) // foo: RuleN[Event::DateTime::String::HNil]

In fact, we are constructing a corporate shapeless HList . The type of the resulting rule will be RuleN[Event::DateTime::String::HNil] .

Similarly, you can take values from the stack of values without giving anything in return: for this, the lambda only has to “return” the type Unit . The type of the resulting rule, as you probably guessed, is Rule0 :

 (foo: rule1[String]) ~> (println(_)) // foo: Rule0

In addition, the action operator offers especially sweet sugar for case classes:

 case class Person(name: String, age: Int) (foo: Rule2[String, Int]) ~> Person // foo: Rule1[Person]

However, it should be noted that the compiler may not digest this sugar, if a companion object is defined for a case class. Then you have to add a lambda, a few underscores and write: ~> (Person(_, _)) .

Sugar for case classes is ideal for building AST, experienced users may even notice that in this case it works quite similarly to the operator ~~> from Parboiled1. There are other ways of applying ~> , but you will learn about them not from me, but from the documentation. I will only note that the ~> operator is implemented in the Parboiled2 code in a very non-trivial way, but no matter how difficult its definition would look, it is a pleasure to use it. Perhaps the best technical decision made at the stage of creating a DSL.

run

A special action operator version for thrill-seekers. For a programmer, in many respects, run behaves exactly the same as ~> , except for the little inconvenience when in the case of run compiler does not automatically infer types and must be explicitly designated. The operator is a very convenient tool for creating untestable side effects, for example as follows:

 def RuleWithSideEffect = rule { capture(EmailAddress) ~ run { address: String => send(address, subj, message) } ~ EOI }

The type of the resulting rule will be Rule0 , and the matching string is not needed by anyone and will not fall into any stack of values, which is sometimes necessary. Parboiled1 users probably noticed that in the context described above, run behaves the same way as the ~% operator.

Warning: When using side effects, do not flirt with a stack of values. Yes, you can get direct access to it, but for a number of reasons it is better not to do it.

push

The push function places data on a stack of values if the corresponding rule matches it. In practice, I have not had to use it often, since most of the work can be done by the ~> operator, but there is an example in which push simply shines:

 sealed trait Bool case object True extends Bool case object False extends Bool def BoolMatch = rule { "true" ~ push(True) | "false" ~ push(False) }

Although this is not noted anywhere, this rule follows the semantics of call-by-name and is evaluated every time, and therefore its argument is calculated every time. This usually has a detrimental effect on performance, so push best used with constants and only with constants.

As in the case of run and ~> , the type of the value passed to push determines the contents of the stack and the type of rule being created.

Nested parsers

In Parboiled2, there is support for nested parsers: capturing the text and feeding it to the operator ~> we get the variable of the string type as a parameter of the lambda function. After some operations with the side, we can feed it to some subparser and so on. In practice, it was not necessary to apply, but you should know that there is such an opportunity.

AST generation

We have all the necessary knowledge to write our parser, generating a syntactic tree. Syntax trees are built from nodes. Therefore, we begin with them, or rather with their description:

 sealed trait AstNode case class KeyValueNode(key: String, value: String) extends AstNode case class BlockNode(name: String, nodes: Seq[AstNode]) extends AstNode

Each of the case classes corresponds to a particular type of node; everything seems to be clear and understandable. Nevertheless, let's try to find something in common among the above nodes. Everyone has a name, just in the case of a key-value pair, this is a key. It is also necessary to distinguish the nodes among themselves.

 sealed trait AstNode { def name: String } case class KeyValueNode (override val name: String, value: String) extends AstNode case class BlockNode (override val name: String, nodes: Seq[AstNode]) extends AstNode

Let's start with the node for key-value pairs. We need to capture the key, capture the value and collect it all in the case class using the ~> operator. Capture we will do "on the spot" (in the rules for the key and value). And we start with the key:

 //          def Key: Rule1[String] = rule { capture(oneOrMore(KeySymbol)) }

Just add capture and that's it - Parboiled thinks about us. The string will be sent to the stack. But with the capture of values, the situation is more complicated. If we verify an operation similar to the key, we will receive a string with quotes. We need them? Therefore, we will capture the line:

 def QuotedString: Rule1[String] = rule { '"' ~ capture(QuotedStringContent) ~ '"' }

You don’t need to do anything for the Value rule, it will automatically be of type Rule1 (since the body of the string was captured earlier, it didn’t go away from the stack).

Capture capture needs to be done once. And preferably, in the rule where it should have happened

Now we will collect case class:

 def KeyValuePair: Rule1[AstNode] = rule { Key ~ MayBeWS ~ "=" ~ MayBeWS ~ Value ~> KeyValueNode }

We use syntactic sugar and elegantly pack the obtained key and value into a suitable node. Of course, we can use the extended lambda syntax and perform any transformations. But we do not need them. Now let's deal with the list of nodes:

 //     ,       def Node: Rule1[AstNode] = rule { KeyValuePair | Block }

Since each of the nodes is captured, the Nodes rule does not require changes, unless you specify the type of value to be placed on the stack:

 def Nodes: Rule1[Seq[AstNode]] = rule { MayBeWS ~ zeroOrMore(Node).separatedBy(NewLine ~ MayBeWS) ~ MayBeWS }

We have everything to describe the block node. The name will be captured in place, similar to the rule for the key:

 def BlockName: Rule1[String] = rule { capture(oneOrMore(BlockNameSymbol.+)) }

Nodes have already been captured, so just collect the data in the case class:

 def Block: Rule1[AstNode] = rule { BlockName ~ MayBeWS ~ BlockBeginning ~ Nodes ~ BlockEnding ~> BlockNode }

The rule that describes the root of a tree also consists of nodes, so you can do nothing more. And everything seems to be working well and I don’t want to change anything, however, the result does not look very nice: we have two types of nodes, and the root that represents the list of nodes. And the third is clearly superfluous. We can represent the root as a block, with a special name.

 def Root: Rule1[AstNode] = rule { Nodes ~ EOI ~> {nodes: Seq[AstNode] => BlockNode(RootNodeName, nodes)} }

What name to choose? We can give the block a conscious name, such as root, but then unexpected surprises can await us if someone wants to choose the name root. Knowing that BlockName is an identifier that does not allow a series of characters, you can try names like "$root" , "!root!" or "%root%" . Will work. I prefer the blank line:

 val RootNodeName = ""

Empty line:

Satisfies the main requirement — it is not a valid block or key name;
Perfect if we want to expand the grammar. As if we did not torture the parser, something that, and the user definitely will not create an empty string.

Now we have captured data. It remains only to perform a run from the root for a suitable text.

DSL for working with nodes

Having received a working parser capable of rendering a syntax tree, we must somehow work with this tree. Creating a small DSL greatly simplifies this task. For example, we need to go to the next node by name. You can write the same code each time, or you can make a small method (duplicated by an overloaded statement) that can return the next node. Below are the basic methods needed to work with AstNode. On the basis of which you can do a lot of others (most suitable for your needs). If you want, you can give them symbolic names and admire the beauty of the received DSL.

 /** *       parboiled */ trait NodeAccessDsl { this: AstNode => def isRoot = this.name == BkvParser.RootNodeName lazy val isBlockNode = this match { case _: KeyValueNode => false case _ => true } /** *         * - */ def pairs: Seq[KeyValueNode] = this match { case BlockNode(_, nodes) => nodes collect { case node: KeyValueNode => node } case _ => Seq.empty } /** *        *  */ def blocks: Seq[BlockNode] = this match { case BlockNode(_, nodes) => nodes collect { case node: BlockNode => node } case _ => Seq.empty } /** *     "-" */ def getValue: Option[String] = this match { case KeyValueNode(_, value) => Some(value) case _ => None } }

I want to note that there are no unnecessary methods, and almost every time they are required: a recursive search, the ability to change values in the nodes (by changing the state, or using lenses ). Having a variety of auxiliary methods working with wood makes life a lot easier.

As a result, we wrote a functional parser using Parboiled2, and made the work with the resulting syntax tree relatively comfortable. In the next article I will discuss the additional features of the library and the process of optimizing performance. Also will be considered the process of migration from the previous version. I'll tell you about the shortcomings, and how to live with these shortcomings.

Parser code can be found here .

Source: https://habr.com/ru/post/270609/

All Articles