⬆️ ⬇️

About Parboiled (Part 2)

Part 2. Comparison of the text



In the second part of the loop, we’ll talk about the basic rules for parsing characters in Parboiled. We will not touch all the rules - there is documentation for this, I just want you to feel confident with the basic syntax of the rules used in Parboiled.



To consolidate knowledge, we will write a simple recognizer for simple grammar. It is a recognizer, not a full-fledged parser, since it will only match the input text with the rules described by us (also called products ), but will not extract any values ​​from the associated text. The recognizer may be useful in and of itself, as it may work as a validator: if the input was incorrect, the recognizer will let know about it and tell you what went wrong and where. And our recognizer will be very cool when we learn how to extract parsed values, and where does some kind of “value stack”. Here we go?

')

Cycle structure:









Preparatory work



Before starting work with the library, add it to the classpath. In Maven, for example, it is done like this:



<dependency> <groupId>org.parboiled</groupId> <artifactId>parboiled_2.11</artifactId> <version>2.1.0</version> </dependency> 


I use Scala 2.11, but there are artifacts for 2.10 as well.



Rule Description Language (Rule DSL)



All Parboiled functionality is implemented over the syntax of the Scala language using specialized DSL. Therefore, the description of the parser is really nothing more than a declaration of a class derived from org.parboiled.Parser . As an example, let's write a parser that does nothing that does not prevent it from existing and enjoying life:



 import org.parboiled2._ class MyParser(val input: ParserInput) extends Parser { //     } 


DSL constructs and a number of useful classes are added to the visibility area with just one import directive. I want to note that the presence of the input parameter in the constructor is mandatory: this means that for each new set of input data you need to create a new parser object. At first, it scared me very much, but I stopped being afraid when I saw how fast it works.



Rules for individual characters



So, when we already have a worthless parser, we need to add a few rules to it, according to which it will process the data. If you have worked with Parboiled1, you can simply scroll through this section, as my explanations may seem unnecessarily detailed.



Let's start with the terminals. This term will be used in the future, so we will try to give it a definition here (not quite, however, strict):



The terminal is the simplest atomic rule that does not require additional definitions.



Let's describe two simplest rules: the first should recognize a certain known symbol, the second - the line:



 def MyCharRule = rule { ch('a') } def MyStringRule = rule { str("string") } 


Each time to designate their intentions in this way is very tiring. And here we come to the aid of the implicit conversion mechanism (implicit conversions), which allows us to make the rules shorter:

 def MyCharRule = rule { 'a' } def MyStringRule = rule { "string" } 


Strings are matched with exact case sensitivity. However, there are many non-case-sensitive languages ​​(for example, SQL). For them, there is a rule of ignoreCase , which matches the input string regardless of its register. The string passed to it must be in lower case:



 def StringWithCaseIgnored = rule { ignoreCase("string") } 


More details about the rules (or “products”, if you like it that much) will be explained in the next article. All the above (and below) rules are of type Rule0 . The rules are of different types, but now we need to know only what Rule0 means that the rule matches the input string with itself and says whether it matches or not. We did not specify the type because the mechanism for deducing the types of the language is still easy to handle by itself. However, nothing prevents us from specifying the type explicitly:



 def StringWithCaseIgnored: Rule0 = rule { ignoreCase("string") } 


There are special terminals in Parboiled (they are syntactic predicates):





Although the U + FFFF character is reserved for internal use by the Unicode standard, in practice it can easily occur in user input and change the behavior of the parser. So be careful with the text that hits the input.



In addition, if you do not add an EOI at the end of the main rule and an error occurs during the mapping, you will not know about it, since the parser will assume that the input data has not yet ended and will expect new data to arrive. Therefore, whatever you gave to the entrance, a meaningless Success is waiting for you at the exit.



From the rules of chr and str it is hardly possible to make a useful parser, so the first step to meaningfulness will be the ability to determine the range of valid symbols. Parboiled2 makes it very easy:

 def Digit = rule { '0' - '9' } def AlphaLower = rule { 'a' - 'z' } 


Both of these rules match at most one character from a range at a time (or do not match one). Although it is very simple to write these two rules specifically in PB2, there is no need to do this: they are already defined in the CharPredicate object. Parboiled1, on the contrary, forced to manually create these rules, almost every time you write another parser. Therefore, I carried my library of primitives from project to project (I am sure that I did not do this alone). Now my library has noticeably cleared up thanks to the emergence of CharPredicate . It includes, for example, the following rules (I think that it will be clear from the names which categories of symbols they correspond to):





If you are not satisfied with the existing rules, you can always define your own character predicate, for this you need to use the from method:



 CharPredicate from (_.isSpaceChar) 


In addition, for the symbolic predicates, the statements except ( -- ) and union ( ++ ) are defined, which were not in PB1. Personally, I suffered greatly from this absence: I had to close the rule “from the other side”, listing all the black or white list of characters depending on the situation. The rule can also be called a difference, since its role is the same as that of the difference of two sets .



 //    ,    . def AllButQuotes = rule { CharPredicate.Visible -- "\"" -- "'" } //     .  ,  // AlphaNum    . def ValidIdentifier = rule { CharPredicate.Alpha ~ zeroOrMore(CharPredicate.AlphaNum ++ "_") } 


It will be useful to know about two other rules: anyOf and noneOf . They are very similar to except and union , but work on the whole character space of ANY . And most importantly: in this space, they work faster. These functions can take as input a string consisting of enumerations of characters. For example:



 // ,       . def ArithmeticOperation = rule { anyOf("+-*/^") } //  ,      EOI. def WhiteSpaceChar = rule { noneOf(" \t\n") } 


Sometimes the question arises, what to choose: anyOf / noneOf or CharPredicate ? The predefined character predicate will work faster for 7-bit ASCII characters. “Predefined” was written for a reason, and Part 4 of the Best Practices section will explain why. However, for very large character ranges, CharPredicate behaves frankly bad, and then anyOf and noneOf should come to the noneOf .



Chains of rules



N.times



Matching single characters is not interesting, so we move on to more complex rules. Let's start with times , which allows you to match one rule several times in a row. The number of repetitions must be accurate and known in advance.



 def BartLearningParboiled = rule { 100 times "I will never write a parser again. " } 


Some grammars require a hard range of repetitions, for example, from two to five . In the new Parboiled this can be easily arranged:



 def FutureOfCxx = rule { 'C' ~ (2 to 5).times('+') } 


And in the old - there is a rule nTimes , which requires specifying the exact number of repetitions. In case the exact number of repetitions is not known in advance, the next couple of rules will help you.



zeroOrMore



As you may have guessed from the name, zeroOrMore matches a sequence of zero and more occurrences of the specified rule. The attentive reader has already noticed this rule in the examples and it most likely seemed familiar to him: in regular expressions, the exact same operation is indicated by an asterisk, and fans of academic terminology also know what it is called the Kleene star . In any case, using this rule is very simple:



 def Whitespace = rule { anyOf(" \n\t") } def OptWs = rule { zeroOrMore(Whitespace) } 


oneOrMore



A rule similar to the previous one. It does almost the same thing as zeroOrMore , but requires at least one repetition to be present in the input data. Identical to plus wedges for regular grammars.



 def UnsignedInteger = rule { oneOrMore(CharPredicate.Digit) } 


Chain delimiter: separatedBy



Often we have to deal with the case when the set of elements is written in a row through some separator: this is CSV, definitions of lists or arrays, and enumeration of function arguments separated by commas, and much more. In Parboiled2, parsing of such sequences is done easily and naturally:



 def CommaSeparatedNumbers = rule { oneOrMore(UnsignedInteger).separatedBy(",") } 


However, the first version uses a less elegant syntax for this:



 def CommaSeparatedNumbers = rule { oneOrMore(UnsignedInteger, separator = ",") } 


Sequence operator (~)



To specify a sequence of rules, use the ~ operator. In regular expressions, there is no need for such an operator; this fact is written there directly, just as in some BNF variants. For example, let's write a (extremely simplified) rule that matches the date of a particular format:



 import CharPredicate.Digit //     : "yyyy-mm-dd" def SimplifiedRuleForDate = rule { Year ~ "-" ~ Month ~ "-" ~ Day } def Year = rule { Digit ~ Digit ~ Digit ~ Digit } def Month = rule { Digit ~ Digit } def Day = rule { Digit ~ Digit } 




As you can see, the rule is simplified as much as possible, and I am perfectly aware of the fact that we can have 99 days and 99 months. It does not make sense to leave all checks at the parser level: we will still pass the matched string to the input of some class for working with the date and time, which will be guessed to perform validation, and return the result wrapped in Option. But the grammar that we noticeably simplify. Attempting to force the parser to perform all possible and impossible checks often leads to similar results .



"Optional" rule (optional)



If the zeroOrOne rule zeroOrOne , then this would be optional : either there is one entry, or there are no entries at all. Let's analyze the following example: in different operating system families, the end of line marker is encoded differently. For example, in Unix-like operating systems, only the \n character is needed, whereas in Windows a sequence of two characters is used historically: \r and \n . And if we want to process text created on any of these systems, then we can use the following rule for the end of the line:



 def Newline = rule { optional('\r') ~ '\n' } 


Ordered selection (|)



Analog operator | in regular expressions, it is not without reason called the ordered choice. Suppose that we need to recognize a number that may have a sign, but maybe it cannot. A sign, if any, can be of two types: positive and negative, we will first deal with it:



 def Signum = rule { '+' | '-' } 


The sign may be completely absent in the record of a positive number:



 def MaybeSign = rule { optional(Signum) } 


Then the number itself in any case will be presented as a sequence of possible occurrences of the sign of the number and its modulus - a number without a sign:



 def Integer = rule { MaybeSign ~ UnsignedInteger } 


The order of enumeration of options in the Signum rule is important: the very first option that is chosen is selected, which excludes the possibility of grammar ambiguity. And yes, this is how all PEG parsers work without exception. So, if you need to parse an expression in the C language, you need to start the enumeration with the longest operations so that they match first, as the standard prescribes. In simple terms, a rule might look like this:



 def Operator = rule { "+=" | "-=" | "*=" | "/=" | "%=" | "&=" | "^=" | "|=" | "<<=" | ">>=" | "<<" | ">>" | "<=" | ">=" | "==" | "!=" | "||" | "&&" | "->" | "++" | "--" | "<" | ">" | "+" | "-" | "&" | "|" | "." | "*" | "/" | "!" | "~" | "^" | "=" | "," } 


The order of enumeration can be very different, but you need to ensure that in it + always goes after += and ++ , and < - after <= and << (and << , in turn, after <<= ). Otherwise, it may happen that the composite assignment operator <<= parses into the sequence [ <= , = ], or even at all [ < , < , = ].



If the selection rule becomes excessively complex and we do not want to rely on the order of its elements, it is worthwhile to group them according to common prefixes (factor the parser):



 def Operators = rule { ("+" ~ optional("=" | "+")) | ("<" ~ optional("=" | ("<" ~ optional("=")))) | ... } 


We note, however, that none of our examples will be able to automatically take into account the priorities of the operators, for this we will have to resort to more sophisticated rules.



A little sugar



For optional , oneOrMore and zeroOrMore there is syntactic sugar, which makes definitions even shorter: .? .+ and .* . Please use them wisely: if you abuse them, your rules will be a little better read than regulars. With the help of these "labels" we can make the description of our rules less verbose:



 import CharPredicate.Digit def SignedInteger = rule { ("+" | "-").? ~ Digit.+ } def Newline = rule { '\r'.? ~ '\n' } def OptWs = rule { WhitespaceChar.* } 




Run parser



In order to force the written parser to do at least something useful, you need to call the run method of its main (root) rule. If you are writing a unit test for a parser, then it may be worthwhile to call this method for other rules. Brackets after the method are required.



Let's get our useless parser to work, able to match only one string constant. So, our parser is defined as follows (do not forget about EOI ):



 import org.parboiled2._ class MyParser(val input: ParserInput) extends Parser { def MyStringRule: Rule0 = rule { ignoreCase("match") ~ EOI } } 


Now, somewhere in another place, we will create several instances of parsers and feed them with different data:



 val p1 = new MyParser("match") val p2 = new MyParser("Match") val p3 = new MyParser("much") // -  scala.util.Try p1.MyStringRule.run() // Success p2.MyStringRule.run() // Success p3.MyStringRule.run() // Failure 


Passing rules in Parboiled2 is much easier than in Parboiled1, for which there is a whole zoo of runners (parser runners) that have to be additionally called. For more information, please refer to the Error Reporting section of part 4.



Nested data structures



Parsing recursive structures is something that Parboiled can and regular expressions cannot. In Parboiled, this is obtained naturally and naturally, which we will demonstrate in subsequent examples. The only additional effort that is required of you is to explicitly declare the type of rules involved in the recursion.



Analysis of recursive structures is usually illustrated with an example calculator of arithmetic expressions. In my opinion, the example is not at all vivid. Therefore, we consider a fictional configuration file format consisting of named blocks that contain key-value pairs.



BKV format (Block-Key-Value)



As an example, the “BKV” format, which was coined specifically for this tutorial, will be used. It was inspired by the HOCON format and, in fact, is its subset. A BKV consists of key-value pairs and blocks within which pairs can be placed. It looks like this:



 server.name = "webserver" server { port = "8080" address = "192.168.88.88" settings { greeting_message = "Hello!\n It's me!" } } 


As you can see, the format is simple and straightforward, although escaping strings can frighten those who have never written parsers. Escaping is very common when parsing, so we will definitely and in detail consider it.



Screened lines



In order to not have problems with whitespace and non-printable characters in most grammars, the strings are enclosed in double or single quotes (or some kind of similarity, for example, opening and closing angle brackets can be used). Non-printable characters and quotes are escaped.



In order to write a recognizer of shielded strings, you need to decide on the following syntax elements:





First, let's try to describe a rule for a quota line without escaping:



 def OverlySimplifiedQuotedString = rule { '"' ~ zeroOrMore(AllowedChar) ~ '"' } 


Since empty strings are also possible, we use the zeroOrMore rule between quotes. Obviously, a double quote is not included in the list of valid characters. What is allowed then? Anything that is not prohibited. Therefore, for our case, the list of allowed characters is as follows:



 def AllowedChar = rule { noneOf("\"") } 


You can live without a double quote, but it's hard. But what happens if we add a quote inside the string? Upon encountering it, the parser will think that the line has ended and explode with an error message on the next character.



The escape character warns the parser that the next character is special. The algorithm looks like this: the parser expects one of the allowed characters or a shielded sequence, and the shielded sequence consists of the shielding character and the operator of the choice of one of the characters following it:



 def AllowedChar = rule { noneOf("\"\\") | EscapeSequence } //  : \", \\, \n, \a, \f, \v. def EscapeSequence = rule { '\' ~ anyOf("\"\\nafv") } 


Having understood how this works, you can proceed to writing the final version of the rules for screening. To do this, I propose to create a dedicated trait:



 import org.parboiled2._ object QuotedStringSupport { val CharsToBeEscaped = "abfnrtv\\\"" val Backslash = '\\' val AllowedChars = CharPredicate.Printable -- Backslash -- "\"" } trait QuotedStringSupport { this: Parser => import QuotedStringSupport._ def QuotedString: Rule0 = rule { '"' ~ QuotedStringContent ~ '"' } def QuotedStringContent: Rule0 = rule { oneOrMore(AllowedChars | DoubleQuotedStringEscapeSequence) } def DoubleQuotedStringEscapeSequence = rule { '\\' ~ anyOf(CharsToBeEscaped) } } 


Now, when we have dealt with the lines and selected the corresponding functionality in a separate treyt, let's move on to the format itself.



Auxiliary terminals



There are two approaches to writing parsers: “from the general to the particular” and “from the particular to the general.” Typically, grammars are described according to the first, but this is only a tutorial, so let's start with smaller details and then generalize.



We begin the description with auxiliary elements, namely, with spaces. In our case, the spaces will be the characters: '', '' and ''. Of course, there are more whitespace characters in nature, but in the example we will limit ourselves to three. You can deal with spaces in various ways:





We will use the latter. In this case, we will take into account that in some places there may be several spaces, in others there may not be at all, and in some places there must be spaces (but our format does not require mandatory spaces):



 val WhitespaceChars = "\n\t " def WhiteSpace = rule { anyOf(WhitespaceChars) } def OptWs = rule { zeroOrMore(WhiteSpace) } 


The rule describing the line feed we declared earlier:



 def Newline = rule { optional('\r') ~ '\n' } 


The key and the block name represent an identifier similar to the one you can find in various programming languages. The identifier must begin either with a letter of the English alphabet (case does not matter) or with an underscore. In the middle can contain numbers as well as letters of the English alphabet (lowercase and uppercase). The entry point in the middle of the identifier is also valid. Before declaring a key, we will declare a rule describing an identifier. (Similar rules will apply for block name). We need two symbolic predicates: for the first and subsequent characters.



 //    val IdentifierFirstChar = CharPredicate.Alpha ++ '_' //    val IdentifierChar = CharPredicate.AlphaNum ++ '.' ++ '_' 


Let's also declare the beginning and end of the block:



 val BlockBeginning = '{' val BlockEnding = '}' 


Now that we have all the necessary auxiliary terminals, let's do bigger blocks.



Pairs key — value



We now turn to the syntax of key – value pairs. We require that the key be a valid identifier, as described above, and the value was a quoted string, as also described above. So let's start by identifying the identifier:



 def Identifier = rule { IdentifierFirstChar ~ zeroOrMore(IdentifierChar) } 


Perhaps we should not have set the identifier with a fairly rigid rule, but most grammars with which you will most likely encounter use similar rules. For example, identifiers are not allowed to begin with a digit, because of the presence of integer literals, various characters can be valid operators. The rule describing the key will look like this:



 def Key = rule { Identifier } 


To describe the value, use the already existing rule (for this we just need to mix the trait we wrote earlier):



 def Value = rule { DoubleQuotedString } 


Now we describe the rule for the whole pair. Here it is worth recalling once again that Parboiled is a PEG, it follows from this that we constantly need to remember the spaces and inform the rule about the places where they can occur.



 def KeyValuePair = rule { Key ~ OptWs ~ "=" ~ OptWs ~ Value } 


Nested blocks



A block is limited to curly brackets and can contain inside both key-value pairs and other blocks. Therefore, first we need to erase the differences between blocks and key-value pairs, calling both those and other nodes (nodes) of the syntactic tree. This is done by the following code:



 //     ,   ! def Node: Rule0 = rule { KeyValuePair | Block } 


Since both the block and the root structure consist of a list of nodes, we need to declare a rule for this list:



 def Nodes = rule { OptWs ~ zeroOrMore(Node).separatedBy(Newline ~ OptWs) ~ OptWs } 


Optional spaces can be before the list of nodes, and after it, and between its individual elements, so we have so many occurrences in the rule MaybeWs. Now we define the name of the block - this is all the same identifier that is used in the name of the key:



 def BlockName = rule { Identifier } 


Finally, we have everything necessary for declaring a block, therefore we will declare a block:



 def Block = rule { BlockName ~ "{" ~ Nodes ~ "}" } 


Remember we defined BlockBeginningand BlockEnding? We use them in the ad:



 def Block = rule { BlockName ~ BlockBeginning ~ Nodes ~ BlockEnding } 


Notice that it Blockrefers to a rule Nodesthat will refer to a Node rule. A Node can be referred to as a Block rule, which causes a loop. Therefore, we need to explicitly specify the type of the rule, reassuring Parboiled. Since we are writing a recognizer, the type of the rule will always be Rule0 (more details on the types of rules will be in the next article).



So, we have everything, lacking only the entry point, or root (root), which is also nothing more than a list of nodes for which we already have a ready-made rule. We use it, not forgetting to take into account possible spaces and complete the rule with the symbol EOI:



 def Root: Rule0 = rule { Nodes ~ EOI } 


So we wrote a recognizer. You can view its full source code here .



Since it’s very rare to only compare values ​​in practice, and constantly extracting them from the text, in the next article I will tell you exciting stories about how this is done, as well as the types of rules. In it, we will bring our resolver to the state of a full parser.

Source: https://habr.com/ru/post/270531/



All Articles