Pro Parboiled (Part 4 final)

Part 4. The harsh reality

How to make Parboiled work even faster? What mistakes are better not to allow? What to do with the inheritance in the form of Parboiled1? The concluding article of the series is designed to answer these, as well as other questions.

Cycle structure:
')

Performance

Parboiled2 runs fast, but sometimes it can work even faster. In this section we will talk about available micro-optimizations. The main thing when performing optimizations is timeliness. But if it is possible to immediately write a little more optimal code, without losing expressiveness at the same time, this opportunity should definitely be used.

`n.times` for small n <= 4

You can win in performance if, for small n, instead of the repetition operator n.times you simply join several repetitive rules into a chain. How many repetitions it makes sense to unfold depends on the circumstances, but this number is hardly more than four.

 //  rule { 4 times Digit } //  rule { Digit ~ Digit ~ Digit ~ Digit }

The relevance of this optimization was declared by Matthias himself, although, hypothetically, the n.times operator could also perform it himself.

Accelerating stack operations for `n.times`

Using this technique will allow you to squeeze a little bit of performance when extracting data from a stack of values. For example, so it can be applied to the previous rule:

 def Digit4 = rule { Digit ~ Digit ~ Digit ~ Digit ~ push( #(charAt(-4))*1000 + #(charAt(-3))*100 + #(charAt(-2))*10 + #(lastChar) ) }

Do not recreate `CharPredicate`

It is perfectly normal to rejoice at the new features of the CharPredicate class, but you should not create your own instances of the CharPredicate type inside the rule block: your predicate will be recreated every time the rule is fulfilled, which will dramatically ruin your parser's performance. Therefore, instead of creating character predicates each time, define them inside your parser as a constant:

 class MyParser(val input: ParserInput) extends Parser { val Uppercase = CharPredicate.from(_.isUpper) ... }

or, even better, send this ad to your parser's companion object:

 class MyParser(val input: ParserInput) extends Parser { ... } object MyParser { val Uppercase = CharPredicate.from(_.isUpper) }

Use semantic predicates

The peculiarity of these rules is that they do not interact with the stack of values. In detail, they are described in the documentation, but the most important thing is that you should know about them:

When using semantic predicates, the parser does not make progress, that is, it does not move its cursor to the next character. Therefore, with their mindless use of the parser can loop.

Remember the example of declaring a character predicate for uppercase characters? You can do the same thing using the semantic predicate test :

 def JavaUpperCase = rule { oneOrMore(test(currentChar.isUpper) ~ ANY) }

Use `ANY` where you would like to see `CharPredicate.All`

Alas, CharPredicate.All is slow for large ranges of characters, ANY is faster. Take advantage of this knowledge.

Use an inverting predicate

Imagine that your parser should capture all characters before a line feed (for definiteness, in Unix style). Of course, this can be done with noneOf , but the inverting predicate will be faster:

 def foo = rule { capture(zeroOrMore(noneOf("\n"))) } // ? def foo = rule { capture(zeroOrMore(!'\n')) }

Unfortunately, this great looking example will loop, because the parser will not make any progress. To fix this, you need a rule that moves the cursor of the parser, but does not change the stack. For example, like this:

 def foo = rule { capture(zeroOrMore( !'\n' ~ ANY )) }

Now the rule foo will absorb absolutely everything except EOI and line feed.

Error reports

I don’t think that you want to work with the parser that produces meaningless messages with any incorrect input data. Parboiled2 is able to quite clearly talk about errors, if you help him in this.

Formatting

So, if something is screwed up, the parser will give you an object of type ParseError , which can be put into a readable form using the formatError method:

 val errorMessage = parser formatError error

If the default formatting does not suit you for some reason, you should express your wishes to the parser explicitly:

 val errorMessage parser.formatError(error, new ErrorFormatter(showTraces = true))

If you want to write your ErrorFormatter , you will have to figure out for yourself the structure of the ParseError class, which is declared in the depth of Parboiled in the following way:

 case class ParseError(position: Position, charCount: Int, traces: Seq[RuleTrace]) extends RuntimeException

It is also worth noting the presence of several schemes for delivering error messages to the user: at your request, ParseError can be represented not only as a Try object, but, for example, as a polymorphic type or Either . More details can be found here .

 def Foo = rule { "foo" | fail(" !") }

Fine tuning

There is an option to bypass the built-in error reporting mechanism. To do this, use the fail rule with the message you want to see in case of an error:

 def Goldfinger = rule { "talk" | fail("to die") }

Then at an opportunity, you will get back your error message in the following form:

 Invalid input 'Bond', expected to die. (line 1, column 1):

Named Rules

The use of this type of rules is very useful not only in order to catch errors. This mechanism is described in detail in the “Best Practices” section.

atomic

Parboiled2 generates PEG based parsers. This means that the parsers operate with characters, not strings (as many might have thought), so errors will be shown to you at the character level. Agree - a message of the form “You have X here, we expected Y or Z” will require more mental efforts than “You have XX here, and we expected to see XY or XZ”. In order to see the lines in the error reports as a whole, there is a atomi marker, all you have to do is wrap the rule in it:

 def AtomicRuleTest = rule { atomic("foo") | atomic("fob") | atomic("bar") }

To receive foxes at the entrance

 Invalid input "fox", expected "foo", "fob" or "bar" (line 1, column 1): foxes ^

quiet

When there are too many options to choose from, you don’t always want to notify the user about all possible alternatives. For example, in a certain place your parser expects a lot of whitespace characters in conjunction with a certain rule. To eliminate redundancy in a report, you may want to keep silent about spaces. Using the quiet marker is very simple:

 def OptionalWhitespaces = rule { quiet(zeroOrMore(anyOf(" \t\n"))) }

Honestly, there are no situations that encourage the use of this rule. As well as atomic , it is described in detail in the documentation .

Error recovery

Practically the only episode in which Parboiled1 wins, and in Parboiled2, things are not very good: the parser falls only on the type of the first error it encountered. For most scenarios, this is great: for example, it does not interfere with parsing logs, text protocols, configuration files (for some cases), but this will not please the developers of DSL or IDE-like tools. Matthias promises to fix this , so if you really need this functionality today, write to the bug tracker, perhaps this will speed up the development process.

Parboiled1 has a huge number of ParserRunners for all occasions. Look toward RecoveringParserRunner if you need to continue parsing in case of errors.

Testing

Parboiled developers use the specs2 framework for testing, which they supplemented with their auxiliary class TestParserSpec . It will seem inconvenient to those who use the scalatest, but his main idea can be adopted. In secret from Mathias, his decision is not particularly accurate, as it relies on a changeable state. Perhaps in the future we will have to wait for something similar to a full-fledged framework for testing.

Rules can be tested both separately and together. Personally, I prefer to write tests not for every rule, but to check only the main rule in “special” cases:

In many formats, even standardized, very interesting moments can occur. For example, in the BSD-like message format of RFC 3164 , there are always two positions for the day of the month, even if the number itself has one digit. Here is an example from the RFC itself:

If the day is less than 10, then it must be represented as the space and then the number. For example, the 7th day of August would be represented as "Aug 7" , with two spaces between the "g" and the "7" .

In addition to this kind of “interesting moments”, you can feed the parser strings with unclosed brackets, invalid characters, check the order of operations with the stack of values.

In testing there is another subtlety that you will immediately encounter. Suppose you want to test the following rule:

 def Decimal: Rule0 = rule { ("+" | "-").? ~ Digit.+ ~ "." ~ Digit.+ }

To do this, we will send the parser obviously incorrect input and wait for the output error:

 //         . val p = new MyParser("12.3.456").Decimal.run() // Success(()) p.isFailure shouldBe true //

But when you run the test, it turns out that the parser returned a successful result. Why is that? There is no EOI in our rule, but if we add an EOI to it, we spoil all the rules that Decimal use. Therefore, it is necessary to create a special testing rule, for example, using the cunning mechanism of meta-rules . Let's add an EOI at the end of our previous example, and make sure that the parser fell with an error:

 Failure(ParseError(Position(5,1,6), Position(5,1,6), <2 traces>))

Disadvantages of parboiled

Parboiled2

If people have shortcomings, why not have them? Here Parboiled2 is no exception.

Long, too general and completely incomprehensible error messages from the compiler, in the best traditions of C ++. An illustrative example is shown in the figure below (the ~ rule is inadvertently omitted in the rule). The reason is related to performing advanced checks on types that promise to be removed in future versions.

The compiler swears dirty

This problem no longer relates to Parboiled2, but to scalac. The compiler can tear down the roof if argument types are explicitly (not) defined on a lambda that captures values from the stack:
```
 //    def MyRule = rule { oneOrMore(Visible) ~> {s => "[" + s + "]"} } //    def MyRule = rule { oneOrMore(Visible) ~> {s: String => "[" + s + "]"} } 
```
What works and what doesn't depends on the version of your compiler.
Many IDEs have not yet learned how to support macro-expressions, and Parboiled2 was built not without their help. Therefore, do not believe the underscores of your development environment. Once, having forgotten about it, I spent the whole day searching for a non-existent error literally out of the blue.
Lack of recovery mechanism in case of unsuccessful parsing. Designing domain-specific languages, or those who want to use Parboiled2 as a front-end to their compiler, this will greatly disappoint. But work on it. If you want to see this opportunity - write, it will speed up the development.

I think that many developers of their small IDE and text editors would like to see more flexible error messages than those that are provided now. At the moment there are only two ways to influence them:
- named rules
- named nested rules.

Parboiled1

Most projects are still written in Parboiled1, and it is unlikely that something will change dramatically and drastically (in the enterprise), so it may be useful to know how to learn to put up with its shortcomings, of which Parboiled1 has a lot. In addition to the very limited DSL, Parboiled has a “Rule8” problem, which complicates writing a parser for logs. Parboiled1 is designed so that for each rule with N elements there is a class, by analogy with tuples: there is Rule0 , Rule1 , right up to Rule7 . This is quite enough to parse complex programming languages, such as Java, and generally does not cause significant problems when parsing tree structures. But if you need to extract data from a linear structure, for example, log-file messages, then it is very easy to rest against this restriction. This is solved using a tuple instead of a single resulting rule. Here is an example:

 def Event: Rule1[LogEvent] = rule { Header ~ " " ~ UserData ~ " " ~ Message ~~> { (header, data, message) => SyslogEvent ( header._1, header._2, header._3, header._4, header._5, data._1, data._2, message ) } }

Let it look poor, but the problem is solved.

Best practices

In this section I will talk about common truths that work for any parser combinator, as well as about the nuances specific to Parboiled2.

Charutils

There is one useful object not mentioned in the documentation: CharUtils . It contains a number of static methods that can facilitate your life, for example: changing the case of characters, escaping, converting integer values into the corresponding characters (strings). and others. Its use may save your time.

Write unit tests

One small unsuccessful change can break your grammar and provide acute rectal pain. This is a trivial advice that many people neglect. The parser is not as difficult to test as, say IO: you do not need Mock objects and other tricks for this routine, but very valuable work. We had a whole parser infrastructure. And believe me, the first thing I did when searching for errors was to sit down and write tests, if they were not.

Make parsers and rules small

Split your parsers into sub-parsers. Each component must do something well defined. For example, if you parse LogEvent, for which the Timestamp field is defined (especially if this Timestamp corresponds to some Rfc), then do not be lazy and take it out separately.

First, it will reduce the code of your main prasser, and make it clearer.
Secondly, it greatly facilitates testing. You will cover your podparser with unit tests. And then proceed to the development of the main parser

There are different approaches:

Break parser into traits and use self-typed reference (I prefer this method).
Declare parsers as separate entities and use composition.
Use the built-in mechanism to create subParsers.

The rules should be as compact as possible, but not more compact. The smaller your rules, the easier it is to find a mistake in the grammar. It is very difficult to understand the logic of the developer, if he makes the rules long, and at the same time reuses capture . Implicit capture can aggravate the situation. Specifying the type of rule also helps with support.

Send case objects instead of strings in the Value stack

This advice can also be attributed to optimizations, since this will make the parser work faster. Send meaningful objects to Value stack, not strings. This will make your parser faster, and the code more clearly.

Poorly:

 def logLevel = rule { capture("info" | "warning" | "error") ~ ':' }

Good:

 def logLevel = rule { “info:” ~ push(LogLevel.Info) | “warning" ~ push(LogLevel.Warning) | “error" ~ push(LogLevel.Error) }

Use the simplified syntax to build the object.

This beautiful way appeared in Parboiled1. No magic, just the case class constructor is invoked implicitly. The main thing is that the number and type of arguments placed on the Value Stack match the signature of the case class constructor.

Poorly:

 def charsAST: Rule1[AST] = rule { capture(Characters) ~> ((s: String) => AText(s)) }

Good:

 def charsAST = rule { capture(Characters) ~> AText }

Named rules

Named rules significantly simplify life when receiving error reports, since they make it possible to use a pseudonym instead of a vague rule name. Or mark the rules with a specific tag - "This expression" or "Modifies the stack." In any case, to know about this function will be useful.

Many Parboiled1 users have already loved this feature. For example, Neo4J developers using Parboiled to parse the Cypher language.

What it looks like in Parboiled1:

 def Header: Rule1[Header] = rule("I am header") { ... }

In Parboiled2:

 def Header: Rule1[Header] = namedRule("header is here") { ... }

It is also possible to give names to nested rules:

 def UserName = rule { Prefix ~ oneOrMore(NameChar).named("username") ~ PostFix }

Migration

Migration is often a simple process, but it takes a lot of time. Therefore, I will try to save at least a little precious hours of your life and describe the main pitfalls.

Classpath

In order to avoid conflicts with the first version, Parboiled2 uses the classpath org.parboiled2 (while the classpath uses the first version of org.parboiled ). The Mavenovsky groupId , however, remained old: org.parboiled . Due to this, it is possible to have both dependencies in one project and make a gradual transition to the new version. Which, by the way, works quite well in the presence of several autonomous parsers. If your parsers consist of many modules that are reused in different places (as was the case in my case), you will have to do the migration at once for all modules.

Test validation

Verify the availability and operability of the unit tests. Do you have them? Not? Write them. During the migration process, I had to refine some grammars because the new DSL became more powerful, and unintentional changes broke grammars. Falling tests saved a lot of time. I did not encounter serious problems, such as a breakdown of the entire grammar. Maybe someone will share their experience if this happened to him.

Code around parser

Now the parser will be recreated every time, which is not always convenient. With PB1, I really liked to create a parser once, and then repeatedly use it. Now this number will not work. Therefore, you will have to change the constructor of the parser and rewrite the code using it a little, and do not be afraid that this will worsen performance.

Warning Parboiled1 allows you to generate rules at run time. Therefore, if you have a similar parser, then you will most likely have to rewrite it: Parboiled2 uses macro expressions that make the dynamics very difficult, giving you better performance instead.

Composition

The approach to the composition of the elements of the parser has not changed, this is good news for migrants. However, Parser is no longer a treyt, but an abstract class. Traits are the most convenient means of composing software components, in PB1 this allowed Parser to be mixed into any modules, mixing the modules together. The change in favor of the abstract class had no effect on this possibility, but now for this you need to use self-typed reference :

 trait Numbers { this: Parser => //   }

Those who have not used such a language opportunity and each time mixed the Parser treit will have to change their taste preferences.

As an alternative method, you can make full parsers of your traits and import the necessary rules (as methods) from them into your main parser. I, however, still prefer to use the composition of the traits, because I find them more visual: it is clearer for me to see the parser assembled from pieces, instead of multiple imports.

Getting rid of primitives

During the migration process, be sure to arrange a revision of your personal library of primitive rules: delete everything that is in CharPredicate . Your library will lose weight, but it will not disappear altogether. Many people would like to add to parboiled support for various date formats, a grammar describing email, and HTTP headers. Parboiled is just a parser combinator: it was such, it will remain so. However, you must admit that it is very nice to throw out the old code.

Conclusion

In this series of articles, I tried to tell you about the most progressive and promising parsing tool that exists for the scala language.I made a small tutorial and talked about the problems that I had to face in practice. I hope that this article in the worst case will be useful for you, and at best - will be a guide to action.

Used sources

Thanks

I want to express my gratitude to Alexander and Matias for having a reason for the article, as well as a handy tool. Yana, thank you for reading and editing my many mistakes, I promise I will write more intelligently. Thank you firegurafiku and Too Taboo for helping us with the first article in Vertsk, proofreading, numerous corrections, and ideas for examples in subsequent ones. Thanks to Vlad Ledovskikh for proofreading and corrections in the last article of the series. Thanks nehaev , for the error found in the article code, and Igor Kustov for the idea of breaking the huge article into parts (I didn’t want to do this for a long time). Special thanks to Arseny Alesandrovich primetalk , for inaccuracies found and
useful suggestions. Thanks to all those who followed the cycle of articles and reached the last. I hope the work was not done in vain.

Source: https://habr.com/ru/post/271003/

All Articles