Part 1. Why Parboiled?
Today, in the light of the rapid growth of the popularity of functional programming languages, parser combinators, tools that facilitate the analysis of mere mortals, are increasingly being used. Libraries such as
Parsec (Haskell) and
Planck (OCaml) have already managed to prove themselves well in their ecosystems. Their convenience and capabilities at one time prompted the creator of the Scala language, Martin Oderski, to add their analogue -
Scala Parser Combinators (now rendered in
scala-modules ) to the standard library, and the knowledge and ability to use such tools - are attributed to mandatory requirements for Scala-developers
level A3 .
This series of articles is devoted to the
Parboiled library, a powerful alternative and possible replacement for the Scala Parser Combinators. In it, we will look in detail at working with the current version of the library - Parboiled2, and also pay attention to Parboiled1, since most of the existing code still uses it.
Cycle structure:')
Introduction
Parboiled is a library that allows you to easily parse (parsit) markup languages ββ(such as HTML, XML or JSON), programming languages, configuration files, logs, text protocols, and any text whatever. Parboiled will come in handy if you want to develop your own domain-specific language (
DSL ): with its help, you can quickly get an
abstract syntax tree and, remembering the
interpreter pattern, execute the commands of your domain language.
At the moment there are several versions of this library:
- Parboiled for Java is the very first version of the library. Written by Matthias Doeniz in Java and for Java. It is still popular, although it is in the βend of lifeβ state. If, by chance, you inherited it, or you consciously start a project in Java, I advise you to consider as an alternative grappa - fork Parboiled1, which is carefully maintained in a working condition by the user with the nickname fge .
- Parboiled - the library, now better known as Parboiled1, came into existence after Matthias was imbued with a rock. He made a Scala-frontend for Parboiled, at the same time abandoning support for the Java version. With the release of Parboiled2, the Scala version of Parboiled1 is no longer supported, but in spite of this, it is not worth it to debit it yet:
- Parboiled2 has not yet learned all the features of Parboiled1;
- Parboiled1 is still used much more widely than Parboiled2, so if you are suddenly transferred to some old Scala project, there is a high chance to encounter it.
- Parboiled2 is the newest version of the library, eliminating a number of flaws in PB1. Works faster and, most importantly, supported by developers.
I wrote this article with an emphasis on Parboiled2 (by the way, I will continue to write about him in the masculine, without the word βlibraryβ), but sometimes I will be distracted to talk about the important differences between the first and second versions.
Main features
Brief description of Parboiled2:
- Follows the principles of PEG .
- Generates single pass parsers. A separate lexer is not required.
- A type-safe DSL is used, which is a subset of the Scala language.
- Optimizations are performed at compile time.
In practice, this means:
- You do not need to write a parser with your bare hands.
- Readability, comparable with the best grades of BNF (in my opinion, PB is even better).
- You can use all the power of PEG and freely parse recursive data structures, while regular expressions cannot by definition . Yes, you will not parse either JSON or even the simplest arithmetic expression with regular expressions, let alone programming languages. StackOverflow has a well-known quote in the subject :
Paris hilton to write an operating system.
- Even if you need to parse the linear structure, Parboiled2 (when using proper optimizations) will work faster than regular expressions. Evidence is given in the next section.
- Unlike parser generators such as ANTLR , you are freed from the hassle of separately generating code and then compiling it. All Parboiled code is written in Scala, so you get syntax highlighting and type checking out of the box, as well as the lack of additional operations on grammar files, while the parser generated by ANTLR will have two phases of syntax parsing. True, despite this, ANTLR is still more powerful, documented and more stable, and therefore it may be preferable in many ( very non-trivial) cases.
- Skalovsky parser combinators are slow. So slow. Indecent slowly. Matthias compared performance of parsers for Jackson and JSON, written using Parboiled, Parboiled2 and Scala Parser Combinators. With disappointing results for the latter can be found further in the text.
- Unlike Language Workbenches , Parboiled is a small and easy-to-use library. You do not need to download a poorly documented inhibiting monster and spend precious hours of life on the exhausting search for the right menus and buttons just to describe a small DSL. On the other hand, you will not get a ready text editor with your DSL highlighted out of the box, instead you will have to write a plugin for Vim, Emacs or your IDE yourself, but this does not make Parboiled a less worthy alternative for developing small subject-oriented languages.
- Parboiled has successfully proven itself in many projects , including the bloody enterprise.
New in version two
This section will be mainly useful and understandable to those who have already worked with the first version of the library. Beginners, most likely, should return to this list after reading the entire series of articles.
First of all, Parboiled2 successfully eliminates a number of childhood diseases of the first version:
- Now you can use more roomy rules than
Rule7
. For this, the shapeless library with its famous HList
: now one rule can operate with a large number of values ββon the stack. This also means that Parboiled2 has an additional dependency, which was not in PB1 - the shapeless library itself. - Added missing constructions. Thus, in Parboiled1, it was impossible to specify a dynamic number of repetitions for the
nTimes
rule and one had to use the softer oneOrMore
rule, which did not give the required accuracy of the grammar description. - Added built-in primitive terminals. There was a new class
CharPredicate
, which contains such fields as AlphaNumeric
, Hex
, Printable
, Visible
and others. - Added the ability to expand and narrow the predicate. The need to exclude a few characters from the rule arose before, but now it can be easily taken and done, and not to create a white list of characters.
Besides:
- Parboiled2 uses macros, which allows you to generate a grammar at the compilation stage, and not at runtime, as it was in Parboiled1. This greatly increases the performance of your parser, as well as increases the number of checks. In this connection, the
rule
block became mandatory, although Parboiled1 allowed in some cases to do without it. You will notice this innovation first of all when you will do migration of an old code. - Improved error reporting system.
- There is support for scala.js . The demo project can be viewed here .
Performance comparisons
Parboiled1 is known for its sluggishness (in any case, in relation to parsers generated by ANTLR), due to the fact that all rule matching actions were performed in runtime and the compiler could not perform any significant optimizations on such a parser. In Parboiled2, performance was put at the forefront and many things were redone on macros, so the compiler got a lot of freedom in optimizing, and the user got the long-awaited performance. Below we will demonstrate what good results the developers have achieved.
Parboiled against JSON parsers written with straight hands
Parboiled is a generalized tool for creating parsers, and as you know, a specialized tool always turns out to be better than a generalized tool for solving its specialized task. In the Java world, there are a small number of JSON parsers written by hand by ancient elven masters, and Alexander Myltsev (one of the Parboiled2 developers) tested how much Parboiled is losing in performance to these artifacts.
The results were quite optimistic, especially in the case of Parboiled2.
- β , β βββββββββββββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββββββββββββββββββββββββ Parboiled1JsonParser β 85.64 β ββββββββββββββββββββββββββββββ Parboiled2JsonParser β 13.17 β ββββ Json4SNative β 8.06 β βββ Argonaut β 7.01 β ββ Json4SJackson β 4.09 β β
Parboiled vs regular expressions
Thanks to the use of static optimizations, Parboiled2 is able to work much faster than regular expressions (at least those that come bundled with the Java class library). Here are some confirmations from
the mailing list :
- β , β βββββββββββββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββββββββββββββββββββββββββ Parboiled2 (warmup) β 1621.21 β ββββββββββββββββββββββββββββββββ Parboiled2 β 409.16 β ββββββββ Parboiled2 w/ better types (warmup) β 488.92 β ββββββββββ Parboiled2 w/ better types β 134.68 β βββ Regex (warmup) β 621.95 β ββββββββββββ Regex β 620.38 β ββββββββββββ
Parboiled vs. Scala Parser Combinators
In the mailing list, you can find
another performance test , which is in good agreement with the first (about JSON) and contains data for comparison with Scala Parser Combinators. Everything is very, very sad.
- β , β βββββββββββββββββββββββββββββββββββββββΌββββββββββββΌβββββββββββββββββββββββββββββββββ Parboiled1JsonParser | 73.81 | β Parboiled2JsonParser | 10.49 | β ParserCombinators | 2385.78 | ββββββββββββββββββββββββββββββ
What parboiled can't
Most articles about parser combinators start with exhausting explanations of what PEG is, what it is and why it should be feared. In order to parse configs, a thorough understanding of this is not necessary, but you should still be aware of the limitations of this type of grammar. So, Parboiled basically does not know how:
- Parse left-recursive grammar. It is beyond the power of all downstream parsers (top-down parsers), which include PEG. However, the left-recursive grammar can be adapted .
- Disassemble grammars in indentation (indentation-based grammars), for example Python or YAML. It is impossible to do this due to the fact that the generated parser is single pass, without a separate lexer. Indentation is performed at the stage of lexical analysis. This problem has a simple solution: write a preprocessor that
INDENT
virtual markers before ( INDENT
) and after ( DEDENT
) indenting. In Parboiled1, there are standard tools for this, but for Parboiled2, a similar procedure so far will have to be done independently. - Use streaming input. PEGs use a return search, also known as backtracking . Theoretically, this disadvantage can be eliminated by stream buffering, but nothing prevents you from writing a grammar in which you return to the very beginning. Therefore, in order for this idea to work in practice, it is necessary to learn how to determine, by grammar, the boundaries of chunks, between which return is impossible. Matthias is very interested in the development of this feature, so its appearance is possible in the next releases.
In the next part, Iβll tell you how Parboiled describes a custom grammar, and we will write a simple recognizer for the tree-like format of the configuration files.