📜 ⬆️ ⬇️

Scala: parser combinators on the example of the formula parser

From time to time I have a desire to invent my own little programming language and write an interpreter. This time I started writing on scala, I learned about the parser combinators library, and was amazed: it turns out, you can write parsers easily and simply. In order not to turn the article into an "owl drawing" manual, below is the implementation of parsing and calculating expressions like "1 + 2 * sin(pi / 2)" .


The parsing and calculation of the expression itself occupy only 44 non-empty lines - not that I strongly wanted to reduce their number, but it looks really simple and concise. Project on github .


For comparison:



So, if you can not wait to see the result:


Responsible for parsing a piece of code
 object FormulaParser extends RegexParsers with PackratParsers { def id: Parser[Id] = "[a-zA-Z][a-zA-Z0-9_]*".r ^^ Id def number: Parser[Number] = "-" ~> number ^^ (n => Number(-n.value)) | ("[0-9]+\\.[0-9]*".r | "[0-9]+".r) ^^ (s => Number(s.toDouble)) def funcCall: Parser[FuncCall] = id ~ ("(" ~> expression <~ ")") ^^ {case id ~ exp => FuncCall(id, exp)} def value: Parser[Expression] = number | funcCall | id | ("(" ~> expression <~ ")") lazy val term: PackratParser[Expression] = term ~ ("*" | "/") ~ value ^^ binOperation | value lazy val expression: PackratParser[Expression] = expression ~ ("+" | "-") ~ term ^^ binOperation | term ... } 

Look at the next line:


 def value: Parser[Expression] = number | funcCall | id | ("(" ~> expression <~ ")") 

It is suspiciously similar to a grammar description, but it is a valid code in which the development environment can immediately detect and highlight most errors.


This is possible for the following reasons:


  1. In scala, it is allowed to give methods remarkable names like "~", "~>", "<~", "|", "^^" . The combination of the parsers p and q written as p~q , and the ability to choose one of them: p|q . Reads much better than p.andThen(q) or p.or(q)
  2. Thanks to implicit conversions (implicits), the string "abc" and the regular expression "[0-9]+".r if necessary, turn into parsers.
  3. The language has a powerful static type system that allows you to catch errors right away.

I think I managed to interest you, so then everything will be in order.



Table of contents:


  1. Pegex parsers
  2. Packrat parsers
  3. whole parser code
  4. expression evaluation
  5. conclusion

Parser Combinators.


Once these classes were included in the standard library of the language, but then they were taken out in a separate library. At the end I gave links where you can find more detailed information.


Regex parsers


So, the simplest is RegexParsers. Add implicit conversions from strings and regular expressions to parsers.


 object SimpleExample extends RegexParsers { def boolTrue: Parser[Boolean] = "true" ^^ (_ => true) //    "true",   ,       boolean def bool: Parser[Boolean] = ("true" | "false") ^^ (_ == "true") //         def alternativeBool: Parser[Boolean] = "true" ^^ (_ => true) | "false" ^^ (_ => false) //       def int: Parser[Int] = "[0-9]+".r ^^ (_.toInt) //       . //  .r      def id: Parser[Id] = "[a-zA-Z][a-zA-Z0-9_]*".r ^^ Id // Id - ,       Id } 

By the way, the ~ icon indicates not only the method of the parser, but also the name of the case class that stores a pair of values. A piece of code from Parsers.scala :


 case class ~[+a, +b](_1: a, _2: b) { override def toString = "("+ _1 +"~"+ _2 +")" } 

Suppose we want to build one of several parsers:


 def intInBrackets: Parser[Int] = "(" ~ int ~ ")" ^^ (p => p._1._2) 

What will happen?


  1. "(" implicitly converts a string into a parser, which returns a String
  2. int parser returns Int
  3. "(" ~ int creates a parser for ~[String, Int]
  4. "(" ~ int ~ ")" creates a parser that returns ~[~[String, Int], String]
  5. the parser will call the ^^ method
  6. A function is passed to the method, which takes an argument p type ~[~[String, Int], String] and returns an Int

In this case, the brackets do not carry any useful information. You can do this:


 def intInBrackets: Parser[Int] = "(" ~> int <~ ")" 

This time the brackets will be dropped.



Expressions with the operator <~ are advised to be enclosed in parentheses, since its priority is not very high.
def


 def funcCall: Parser[FuncCall] = id ~ ("(" ~> expression <~ ")") ^^ (pair => FuncCall(pair._1, pair._2)) 

Now it should be clear what the following code does:


 def number: Parser[Number] = "-" ~> number ^^ (n => Number(-n.value)) | ("[0-9]+\\.[0-9]*".r | "[0-9]+".r) ^^ (s => Number(s.toDouble)) // s.toDouble    . def value: Parser[Expression] = number | funcCall | id | ("(" ~> expression <~ ")") private def binOperation(p: Expression ~ String ~ Expression) = p match { case e1 ~ op ~ e2 => BinOperation(e1, BinOperator(op), e2) } 

I was a little lazy and turn the string into a number by standard methods. Time must be saved)


Since our parser description is code, ambiguous grammars still work. In the example of parsing number | funcCall | id number | funcCall | id number | funcCall | id we are trying to parse the number , in case of failure - a function call, etc. The order may be important, for example (id | funcCall) when trying to parse "sin (x)" joyfully parse Id("sin") , and the funcCall parser funcCall not be called. To work correctly, it is better to write (funcCall | id) .


Packrat parsers


Suppose we want to parse a sequence of ones:


 object Ones extends RegexParsers { def ones: Parser[Any] = ones ~ "1" | "1" } 

Parsing ones starts with the fact that we call parsing ones , which again ...



Attempting to split ones will cause the stack to overflow.


In this case, you can change the description so that each time "absorbed" something. For example:


 def ones: Parser[Any] = "1" ~ ones | "1" 

But grammar is not always easy to rewrite. Type 3-2-1 expressions should be recognized exactly as (3-2)-1 , option 3-(2-1) will not work. There will be a similar problem with division. How to do this without complicating grammar?


Packrat - parsers will save us. Their idea is that the parser can store "for itself" some information about calls. For example, to save the result of the work and not to parse the same thing twice ... or to work correctly in cases of recursion.


 object Ones extends RegexParsers with PackratParsers{ lazy val ones: PackratParser[Any] = ones ~ "1" | "1" } 

Trait PackratParsers contains an implicit conversion of strings and other things into parsers of the "desired" type.


PackratParser is best created only once and stored in a variable. In addition, if the p parser uses q , and q uses p , then lazy initialization should be used.


  lazy val term: PackratParser[Expression] = term ~ ("*" | "/") ~ value ^^ binOperation | value lazy val expression: PackratParser[Expression] = expression ~ ("+" | "-") ~ term ^^ binOperation | term 

I think it is now clear how you can easily and naturally parse 3-2-1 as (3-2) -1.


Perhaps you have a question: where does the parser store information? If it is stored directly inside PackratParser, then calling the parser for other input may give incorrect results. So, the necessary information is stored together with the "input" data of the parser. You can look at the library code and see this:


 class PackratReader[+T](underlying: Reader[T]) extends Reader[T] { outer => private[PackratParsers] val cache = mutable.HashMap.empty[(Parser[_], Position), MemoEntry[_]] ... } 

Therefore, the parser accepts not a string as an input, but new PackratReader(new CharSequenceReader(string))


 def apply(code: String): Either[LexerError, Expression] = parse(expression, new PackratReader(new CharSequenceReader(code))) match { case Success(result, next) => Right(result) case NoSuccess(msg, next) => Left(LexerError(msg)) } 

What is the coolest - using packrat parsers does not oblige to anything, they can be combined with ordinary parsers and vice versa.


Parser ready


Complete code:


 object FormulaParser extends RegexParsers with PackratParsers { def id: Parser[Id] = "[a-zA-Z][a-zA-Z0-9_]*".r ^^ Id def number: Parser[Number] = "-" ~> number ^^ (n => Number(-n.value)) | ("[0-9]+\\.[0-9]*".r | "[0-9]+".r) ^^ (s => Number(s.toDouble)) def funcCall: Parser[FuncCall] = id ~ ("(" ~> expression <~ ")") ^^ {case id ~ exp => FuncCall(id, exp)} def value: Parser[Expression] = number | funcCall | id | ("(" ~> expression <~ ")") lazy val term: PackratParser[Expression] = term ~ ("*" | "/") ~ value ^^ binOperation | value lazy val expression: PackratParser[Expression] = expression ~ ("+" | "-") ~ term ^^ binOperation | term private def binOperation(p: Expression ~ String ~ Expression) = p match { case e1 ~ op ~ e2 => BinOperation(e1, BinOperator(op), e2) } def apply(code: String): Either[ParserError, Expression] = parse(expression, new PackratReader(new CharSequenceReader(code))) match { case Success(result, next) => Right(result) case NoSuccess(msg, next) => Left(ParserError(msg)) } case class ParserError(msg: String) } sealed trait Expression case class BinOperator(operator: String) case class Number(value: Double) extends Expression case class Id(name: String) extends Expression case class BinOperation(left: Expression, op: BinOperator, right: Expression) extends Expression case class FuncCall(funcName: Id, argument: Expression) extends Expression 

The result of parsing is either a tree or an error message.


case classes are simply wrapper classes over values, they all implement the Expression interface. the word sealed means that the classes implementing this interface should be contained in the same file. This makes it safe to say that Expression can be one of four types.


Expression evaluation


The code that evaluates expressions is also simple. I assume that the input is valid expressions.


 object Evaluator { def apply(expression: Expression, variables: (String) => Double = Map.empty, functions: (String) => (Double) => Double = Map.empty): Double = { def eval(exp: Expression) = this (exp, variables, functions) expression match { case Number(value) => value case Id(name) => variables(name) case BinOperation(left, op, right) => operator2func(op)(eval(left), eval(right)) case FuncCall(funcId, expr) => functions(funcId.name)(eval(expr)) } } def operator2func(binOperator: BinOperator): (Double, Double) => Double = binOperator.operator match { case "+" => (a, b) => a + b case "-" => (a, b) => a - b case "*" => (a, b) => a * b case "/" => (a, b) => a / b } } 

Scissor tiles - you can declare the eval function inside the apply function to increase readability of the code. The second thing is that we default the Map.empty as an argument by default. It is empty, so it can be of any type, it is immutable, so it will remain empty, and in fact we will get references to the same object - singlelton. Map.empty has a apply(a: In):Out method apply(a: In):Out - we can consider it as a function.


Almost all


Parsing and evaluating expressions are ready. Calculate the resulting lines of code (non-empty):


  1. Parser: 18 lines
  2. case classes for describing AST: 6
  3. expression evaluation: 20 lines.

And that's all - and the code is easy to read, easy to change, and it practically does not contain anything extra. Beauty!


Does it work?


It is worth thinking about this at the stage of writing the parser, but the checking code does not affect anything, because it is shown only now. (Of course, you can write testiki ... but this article is about writing parsers, not tests, so I made it as simple as possible)


The code that checks the work
 object Main extends App { def eval(code: String, variables: (String) => Double = Map.empty, functions: (String) => (Double) => Double = Map.empty) = { val parsed = FormulaParser(code) parsed.left.foreach(error => println(s"\'$code\' parsing error: $error")) parsed.right.map(expr => Evaluator(expr, variables, functions)).foreach(d => println(s"\'$code\' = $d")) } eval("1") eval("0.1") eval("1.") eval(" 1 ") eval("-0.1") eval("1+2") eval("2-1") eval("2*3") eval("4/2") val vars = Map( "pi" -> math.Pi, "e" -> math.E) val funcs: (String) => (Double) => Double = Map( "sin" -> math.sin, "cos" -> math.cos, "inc" -> { d: Double => d + 1 } ) eval("pi", vars) eval("inc(e)", vars, funcs) eval("2+2*2") eval("1+2*(3+4*5)") eval("8/2/2") eval("8-1-2") eval("1. + 2.0 * sin(pi / 2)", vars, funcs) } 

Conclusion


For serious purposes, there are parser generators and other things.


But if you want to parse something relatively simple, experiment and get instant feedback, you can use the approach described above. There is almost no information in Russian, and there is not much in English either. I hope the article will help someone.


Useful links:


  1. library on github
  2. DSL parsing example
  3. "Programming in scala", chapter "parser combinators"

I put the above code on github


How to start

The sbt build system is used. It is enough to install it, go to the folder with the project and type in the console "sbt run"


PS I still have a goal to add my interpreter of a lua-like language with chess and poetess. I seem to have thought through the syntax of the language, then in the process of writing the parser I came across contradictions - I had to abandon a couple of poetess and replace chess with checkers, but the result still turns out to be quite interesting.


')

Source: https://habr.com/ru/post/325446/


All Articles