<dependency> <groupId>org.parboiled</groupId> <artifactId>parboiled_2.11</artifactId> <version>2.1.0</version> </dependency>
org.parboiled.Parser
. As an example, let's write a parser that does nothing that does not prevent it from existing and enjoying life:
import org.parboiled2._ class MyParser(val input: ParserInput) extends Parser { // }
input
parameter in the constructor is mandatory: this means that for each new set of input data you need to create a new parser object. At first, it scared me very much, but I stopped being afraid when I saw how fast it works.
The terminal is the simplest atomic rule that does not require additional definitions.
def MyCharRule = rule { ch('a') } def MyStringRule = rule { str("string") }
def MyCharRule = rule { 'a' } def MyStringRule = rule { "string" }
ignoreCase
, which matches the input string regardless of its register. The string passed to it must be in lower case:
def StringWithCaseIgnored = rule { ignoreCase("string") }
Rule0
. The rules are of different types, but now we need to know only what Rule0
means that the rule matches the input string with itself and says whether it matches or not. We did not specify the type because the mechanism for deducing the types of the language is still easy to handle by itself. However, nothing prevents us from specifying the type explicitly:
def StringWithCaseIgnored: Rule0 = rule { ignoreCase("string") }
ANY
- any character except EOI
.EOI
(End of Input) is a virtual symbol-marker for the end of input, which you definitely want to add to the main rule of your parser. EOI
defined as:
val EOI = '\uFFFF'
EOI
at the end of the main rule and an error occurs during the mapping, you will not know about it, since the parser will assume that the input data has not yet ended and will expect new data to arrive. Therefore, whatever you gave to the entrance, a meaningless Success is waiting for you at the exit.
chr
and str
it is hardly possible to make a useful parser, so the first step to meaningfulness will be the ability to determine the range of valid symbols. Parboiled2 makes it very easy:
def Digit = rule { '0' - '9' } def AlphaLower = rule { 'a' - 'z' }
CharPredicate
object. Parboiled1, on the contrary, forced to manually create these rules, almost every time you write another parser. Therefore, I carried my library of primitives from project to project (I am sure that I did not do this alone). Now my library has noticeably cleared up thanks to the emergence of CharPredicate
. It includes, for example, the following rules (I think that it will be clear from the names which categories of symbols they correspond to):
CharPredicate.All
(works almost the same as ANY
, but shows worse performance on large ranges of characters);CharPredicate.Digit
;CharPredicate.Digit19
;CharPredicate.HexDigit
and many other rules.from
method:
CharPredicate from (_.isSpaceChar)
except
( --
) and union
( ++
) are defined, which were not in PB1. Personally, I suffered greatly from this absence: I had to close the rule “from the other side”, listing all the black or white list of characters depending on the situation. The rule can also be called a difference, since its role is the same as that of the difference of two sets .
// , . def AllButQuotes = rule { CharPredicate.Visible -- "\"" -- "'" } // . , // AlphaNum . def ValidIdentifier = rule { CharPredicate.Alpha ~ zeroOrMore(CharPredicate.AlphaNum ++ "_") }
anyOf
and noneOf
. They are very similar to except
and union
, but work on the whole character space of ANY
. And most importantly: in this space, they work faster. These functions can take as input a string consisting of enumerations of characters. For example:
// , . def ArithmeticOperation = rule { anyOf("+-*/^") } // , EOI. def WhiteSpaceChar = rule { noneOf(" \t\n") }
anyOf
/ noneOf
or CharPredicate
? The predefined character predicate will work faster for 7-bit ASCII characters. “Predefined” was written for a reason, and Part 4 of the Best Practices section will explain why. However, for very large character ranges, CharPredicate
behaves frankly bad, and then anyOf
and noneOf
should come to the noneOf
.
times
, which allows you to match one rule several times in a row. The number of repetitions must be accurate and known in advance.
def BartLearningParboiled = rule { 100 times "I will never write a parser again. " }
def FutureOfCxx = rule { 'C' ~ (2 to 5).times('+') }
nTimes
, which requires specifying the exact number of repetitions. In case the exact number of repetitions is not known in advance, the next couple of rules will help you.
def Whitespace = rule { anyOf(" \n\t") } def OptWs = rule { zeroOrMore(Whitespace) }
zeroOrMore
, but requires at least one repetition to be present in the input data. Identical to plus wedges for regular grammars.
def UnsignedInteger = rule { oneOrMore(CharPredicate.Digit) }
def CommaSeparatedNumbers = rule { oneOrMore(UnsignedInteger).separatedBy(",") }
def CommaSeparatedNumbers = rule { oneOrMore(UnsignedInteger, separator = ",") }
~
operator. In regular expressions, there is no need for such an operator; this fact is written there directly, just as in some BNF variants. For example, let's write a (extremely simplified) rule that matches the date of a particular format:
import CharPredicate.Digit // : "yyyy-mm-dd" def SimplifiedRuleForDate = rule { Year ~ "-" ~ Month ~ "-" ~ Day } def Year = rule { Digit ~ Digit ~ Digit ~ Digit } def Month = rule { Digit ~ Digit } def Day = rule { Digit ~ Digit }
zeroOrOne
rule zeroOrOne
, then this would be optional
: either there is one entry, or there are no entries at all. Let's analyze the following example: in different operating system families, the end of line marker is encoded differently. For example, in Unix-like operating systems, only the \n
character is needed, whereas in Windows a sequence of two characters is used historically: \r
and \n
. And if we want to process text created on any of these systems, then we can use the following rule for the end of the line:
def Newline = rule { optional('\r') ~ '\n' }
|
in regular expressions, it is not without reason called the ordered choice. Suppose that we need to recognize a number that may have a sign, but maybe it cannot. A sign, if any, can be of two types: positive and negative, we will first deal with it:
def Signum = rule { '+' | '-' }
def MaybeSign = rule { optional(Signum) }
def Integer = rule { MaybeSign ~ UnsignedInteger }
Signum
rule is important: the very first option that is chosen is selected, which excludes the possibility of grammar ambiguity. And yes, this is how all PEG parsers work without exception. So, if you need to parse an expression in the C language, you need to start the enumeration with the longest operations so that they match first, as the standard prescribes. In simple terms, a rule might look like this:
def Operator = rule { "+=" | "-=" | "*=" | "/=" | "%=" | "&=" | "^=" | "|=" | "<<=" | ">>=" | "<<" | ">>" | "<=" | ">=" | "==" | "!=" | "||" | "&&" | "->" | "++" | "--" | "<" | ">" | "+" | "-" | "&" | "|" | "." | "*" | "/" | "!" | "~" | "^" | "=" | "," }
+
always goes after +=
and ++
, and <
- after <=
and <<
(and <<
, in turn, after <<=
). Otherwise, it may happen that the composite assignment operator <<=
parses into the sequence [ <=
, =
], or even at all [ <
, <
, =
].
def Operators = rule { ("+" ~ optional("=" | "+")) | ("<" ~ optional("=" | ("<" ~ optional("=")))) | ... }
optional
, oneOrMore
and zeroOrMore
there is syntactic sugar, which makes definitions even shorter: .?
.+
and .*
. Please use them wisely: if you abuse them, your rules will be a little better read than regulars. With the help of these "labels" we can make the description of our rules less verbose:
import CharPredicate.Digit def SignedInteger = rule { ("+" | "-").? ~ Digit.+ } def Newline = rule { '\r'.? ~ '\n' } def OptWs = rule { WhitespaceChar.* }
run
method of its main (root) rule. If you are writing a unit test for a parser, then it may be worthwhile to call this method for other rules. Brackets after the method are required.
EOI
):
import org.parboiled2._ class MyParser(val input: ParserInput) extends Parser { def MyStringRule: Rule0 = rule { ignoreCase("match") ~ EOI } }
val p1 = new MyParser("match") val p2 = new MyParser("Match") val p3 = new MyParser("much") // - scala.util.Try p1.MyStringRule.run() // Success p2.MyStringRule.run() // Success p3.MyStringRule.run() // Failure
server.name = "webserver" server { port = "8080" address = "192.168.88.88" settings { greeting_message = "Hello!\n It's me!" } }
'\n'
, '\t'
and '\v'
). def OverlySimplifiedQuotedString = rule { '"' ~ zeroOrMore(AllowedChar) ~ '"' }
zeroOrMore
rule between quotes. Obviously, a double quote is not included in the list of valid characters. What is allowed then? Anything that is not prohibited. Therefore, for our case, the list of allowed characters is as follows:
def AllowedChar = rule { noneOf("\"") }
def AllowedChar = rule { noneOf("\"\\") | EscapeSequence } // : \", \\, \n, \a, \f, \v. def EscapeSequence = rule { '\' ~ anyOf("\"\\nafv") }
import org.parboiled2._ object QuotedStringSupport { val CharsToBeEscaped = "abfnrtv\\\"" val Backslash = '\\' val AllowedChars = CharPredicate.Printable -- Backslash -- "\"" } trait QuotedStringSupport { this: Parser => import QuotedStringSupport._ def QuotedString: Rule0 = rule { '"' ~ QuotedStringContent ~ '"' } def QuotedStringContent: Rule0 = rule { oneOrMore(AllowedChars | DoubleQuotedStringEscapeSequence) } def DoubleQuotedStringEscapeSequence = rule { '\\' ~ anyOf(CharsToBeEscaped) } }
CharPredicate
containing these three characters;anyOf
. val WhitespaceChars = "\n\t " def WhiteSpace = rule { anyOf(WhitespaceChars) } def OptWs = rule { zeroOrMore(WhiteSpace) }
def Newline = rule { optional('\r') ~ '\n' }
// val IdentifierFirstChar = CharPredicate.Alpha ++ '_' // val IdentifierChar = CharPredicate.AlphaNum ++ '.' ++ '_'
val BlockBeginning = '{' val BlockEnding = '}'
def Identifier = rule { IdentifierFirstChar ~ zeroOrMore(IdentifierChar) }
def Key = rule { Identifier }
def Value = rule { DoubleQuotedString }
def KeyValuePair = rule { Key ~ OptWs ~ "=" ~ OptWs ~ Value }
// , ! def Node: Rule0 = rule { KeyValuePair | Block }
def Nodes = rule { OptWs ~ zeroOrMore(Node).separatedBy(Newline ~ OptWs) ~ OptWs }
MaybeWs
. Now we define the name of the block - this is all the same identifier that is used in the name of the key:
def BlockName = rule { Identifier }
def Block = rule { BlockName ~ "{" ~ Nodes ~ "}" }
BlockBeginning
and BlockEnding
? We use them in the ad:
def Block = rule { BlockName ~ BlockBeginning ~ Nodes ~ BlockEnding }
Block
refers to a rule Nodes
that will refer to a Node rule. A Node can be referred to as a Block rule, which causes a loop. Therefore, we need to explicitly specify the type of the rule, reassuring Parboiled. Since we are writing a recognizer, the type of the rule will always be Rule0 (more details on the types of rules will be in the next article).
EOI
:
def Root: Rule0 = rule { Nodes ~ EOI }
Source: https://habr.com/ru/post/270531/