📜 ⬆️ ⬇️

Inheritance of combinatorial parsers on Julia

Combinatorial (monadic) parsers are quite well known ( wikibooks ). They are a library of small parsers that recognize simple grammar elements, and ways to combine several parsers into one (combine - from here and the name). They are monadic because one of the ways of combining, generating a parser of the rest of the text based on the result of parsing the beginning satisfies the conditions imposed on the mathematical object “monad”. In Haskell, this allows you to take advantage of the powerful service provided by the language and libraries. In other languages, the name “monadic” can be safely ignored - this will not interfere with their implementation and use, including the operation “bind” mentioned above.

The most simple combinatorial parsers are implemented in languages ​​with support for closures, but you can also use the classic OOP (an example is described by Rebecca Parsons in Martin Fowler's book "Subject-oriented languages").
The advantages of combinatorial parsers include ease of use (writing in a programming language is almost the same as the usual grammar description), preprocessor independence (like yacc / bison, happy or ocamlyacc ), the ability to implement some elements that do not fit into context-free grammar, right on general purpose programming language.

The disadvantages are the difficulty of compiling error messages, the inability to work with left-recursive grammar (leads to looping), as well as the fact that it is very easy to make this parser not efficient in speed and memory. (One of the reasons is that the compiler cannot optimize in terms of grammar, as it works at the level of a programming language. But there are other subtleties that need attention if efficiency is required.)
As an alternative, you can consider the implementation in the form of macros (for example, OCaml streams parsers ). In Perl6, grammar support is built into the language.
')

Inheritance

A percer of a specific language consists of many more specialized parsers that reference each other. In this regard, parsers resemble the methods of a certain object. There is a desire to generate parsers for new versions of languages, replacing individual sub-parsers (as is done in the design pattern “template method” from OOP). For experiments with this approach (as well as in the order of learning the next language), I chose the Julia language - dynamically-typed with a special approach to inheritance (similar to CLOS from Common Lisp and R).
Unlike conventional combinatorial parsers, the inheritance approach is experimental (although it is supported in some form by the OCaml macro library and Perl6). While it generates not very readable code. Source code is available on Github .

Implementation


Regular combinatorial parsers are a function that receives a string (or another immutable stream) and returns a container (array) of pairs — the result of parsing, the remainder of the unparsed text. If unsuccessful recognition returns an empty container. It is important that the input stream can be reused - using file descriptors will require special wrappers.

To control inheritance, a parameter has been added to all these functions - the grammar to which the parser refers. Julia allows for multiple dispatching (the choice of a specific method according to the types of several arguments - as opposed to method overloading in C ++ and Java dispatching occurs dynamically), but I use the type of only the first argument, as in normal OOP.

Inheritance tree in Julia can only be built on abstract types, but the leaves can be specific types, instances of which can be created.

abstract Grammar type Gmin <: Grammar end geof(g::Grammar, s) = if length(s) == 0 [((),s)] else empty end 

This describes the abstract type Grammar, a concrete borderless entry that is a subtype of Grammar, and a parser that recognizes the end of a line. If we want to transfer a text representation, other than lines, to the two simplest persers (geof and getTocken, extracting one character from the stream), we can use multiple dispatching and specialize the second argument s - the other parsers will work through them without knowing the flow representation. 'empty' is an array of type [Any].

Julia supports the redefinition of many operators (except those requiring special support in a language such as '&', which may not compute the second argument), and I used this:

  >(g::Grammar, p1, p2) = (g,s) -> reduce((a,r) -> let local v local s v,s = r [a, p2(g,s)] end, empty, p1(g,s)) 

The (combinator) '>' method concatenates two parsers (if the first is applied successfully, the second is applied to the remainder of the string), and the result of the first parser is ignored, and the result of the combination is the result of the second. Similarly defined methods are '<', '-' (concatenation of several parsers, the results of all are combined into an array), '|' (alternative - recognizes a string recognized by any of the parsers). There is also a '+' combinator - a limited alternative that ignores parsers after the first successfully applied. It does not allow organizing a return to an earlier partner if an error occurred later. This is important for efficiency, especially in lazy languages, such as Haskell, where the possibility of such a return leads to the accumulation in memory of a large number of uncalculated values, but it is also useful in energetic ones.

Check how it works:

 julia> include("Parsers.jl") julia> g = Gmin() Gmin() julia> (-(g,getTocken, getTocken, getTocken))(g,"12345") 1-element Array{Any,1}: (Char['1','2','3'],"45") julia> (|(g,getTocken, getTocken, getTocken))(g,"12345") 3-element Array{(Char,ASCIIString),1}: ('1',"2345") ('1',"2345") ('1',"2345") julia> (+(g,getTocken, getTocken, getTocken))(g,"12345") 1-element Array{(Char,ASCIIString),1}: ('1',"2345") 

There is a slight inconvenience - the need to add a grammar object everywhere (in this case of the Gmin type). I went for it for the sake of the possibility of inheritance - classical combinatorial parsers are written easier.

I will also note the useful functions of 'lift', which allows to “raise” the function in the parser and 'gfilter', which checks the value returned by the parser.

  lift(g::Grammar, f, p) = (g,s) -> map((r) -> (f(r[1]),r[2]), p(g,s)) julia> lift(g,int,getTocken)(g,"12") 1-element Array{(Int64,ASCIIString),1}: (49,"2") julia> (|(g,gfilter(g,(c) -> c=='+', getTocken),gfilter(g,(c) -> c=='*', getTocken)))(g,"*123") 1-element Array{(Char,ASCIIString),1}: ('*',"123") 


Parsers can be recursive:

 cut(g::Grammar,n) = if(n==0); mreturn(g,[]) else (-(g,getTocken,cut(g,n-1))) end julia> cut(g,4)(g,"12345678") 1-element Array{Any,1}: (Any['1','2','3','4'],"5678") 


Bit monadic


 mreturn(g::Grammar, v) = (g,s) -> [(v,s)] mbind(g::Grammar, p, fp) = (g,s) -> reduce((a,v)->[a,fp(g,v[1])(g,v[2])], empty, p(g,s)) 

In Haskell, these functions are called 'return' (this is a function, not a language operator) and '>> =' (often pronounced as 'bind').
What do these strange functions do?

'mreturn' is nothing - it just completes successfully, without either reading from the incoming line and returning a predetermined value. This is not only a deep mathematical sense, it often replaces complex logic, which otherwise would have to be applied with the help of 'lift'.

'mbind' is more complicated. It takes two arguments, a parser, and a function that is applied to the result of the first parser, and returns a parser that will be applied as follows:

 julia> mbind(g, lift(g, (c) -> c-'0', getTocken), cut)(g,"23456") 1-element Array{Any,1}: (Any['3','4'],"56") 

By the way, this technique is convenient to use when parsing binary formats, where it is often found that the length of the record is stored just before the record itself.

Arithmetic


 abstract Arith <: Grammar type ArithExpr <: Arith end 

For the arithmetic parser, the abstract subtype Grammar and its concrete implementation are declared.
It defines the gnumber function (g :: Arith, base), which creates a “greedy” number parser in a given number system and a 'snumber' parser that parses a number in the decimal number system.

“Greed” is expressed in the fact that if the parser found one digit, it will not stop until it reads everything. This allows you not to try to apply the following parsers to the under-parsed number, which significantly affects performance.

Now we define addition and multiplication.

 amul(g::Arith,s) = lift(g, (x) -> x[1]*x[2], -(g, snumber, +(g, >(g, gfilter(g, (c) -> c == '*', getTocken), amul), mreturn(g,1))))(g,s) asum(g::Arith,s) = lift(g, (x) -> x[1]+x[2], -(g, amul, +(g, >(g, gfilter(g, (c) -> c == '+', getTocken), asum), mreturn(g,0))))(g,s) 

Here it is worth paying attention that the rule is used (in BNF)

 amul = snumber ('*' amul | '') 

not simpler
 amul = snumber '*' amul | snumber 

The fact is that combinatorial parsers do not have the ability to optimize the grammar, like parser generators, and using the second rule will result in the number being parsed several times.

We try how it works:
 julia> a=ArithExpr() ArithExpr() julia> asum(a,"12+34*56") 1-element Array{Any,1}: (1916,"") julia> 12+34*56 1916 


And now inheritance!


Some languages ​​allow you to specify numbers in an arbitrary number system. For example, in Erlang, the number system can be specified before the number explicitly with the help of the '#' sign. Create a new “arithmetic” by redefining the snumber in it, which would write numbers like in Erlang.

A new snumber always starts with an old one. In order to address the redefined parser, we need the function 'cast', which allows you to take the parser from a specific grammar, bypassing inheritance.
 cast(g::Grammar,p) = (_,s) -> p(g,s) 


The implementation itself uses already familiar techniques:
  abstract GArith <: Arith type GExpr <: GArith end snumber(g::GArith, s) = mbind(g, cast(ArithExpr(),snumber), (g,v) -> (+(g, (>(g, gfilter(g, (c) -> c == '#', getTocken), gnumber(g,v))), mreturn(g,v))))(g,s) 


Check how it works:
 julia> c = GExpr() GExpr() julia> asum(a,"12+34*13#12") 1-element Array{Any,1}: (454,"#12") julia> asum(c,"12+34*13#12") 1-element Array{Any,1}: (522,"") 


A little about the language


Julia was clearly created under the influence of the R language, and is trying to correct some of his flaws. The language was developed with a bias in the direction of performance (which sometimes brings in R) - for this LLVM is involved. In comparison with R, Julia has a more convenient closure syntax, a more developed type system (in particular, there are tuples / tuples). But the rejection of curly braces in favor of the 'end' keyword seems to me to make the usual syntax more cumbersome.
Just as in R, indices start at one. In my opinion, this is an unfortunate decision, reflecting nematematists' fear of zero, but many like it.

Of course, so rich and well-organized libraries, as in R, in Julia yet.

An important application for Julia is likely to be, as for R, data analysis. And to extract this data, you will have to read various text formats. I hope my library will someday be useful for this.

Source: https://habr.com/ru/post/241632/


All Articles