LL (*) parser using Rust macro

Wow. Such Rust. Much macro. © picture - Twitter account Servo

Rust language is rapidly gaining momentum. Someone prophesies him to become a replacement for C / C ++, someone just likes him. I rather belong to the second group. Developers are trying to make it convenient and safe. It has constructions and principles that will not soon appear in the "pros", due to the inertia of the committee and many other reasons. Therefore, for all personal projects, I prefer to use Rust.

It so happened that with varying success I write compilers. I didn’t really have time to write one, but the process itself is more interesting to me than the result.

Once, when I was once again stuck with a parser (aka “parser”), I thought that I was writing a lot of the same type of code. And this one-type code of one to one falls on the grammar in the form of Backus-Naur (BNF).

A little thought, I decided that I need to write a code generator based on grammar. And for this task the macros in Rust are perfectly suitable.

The article describes the implementation of the LL (*) parser using macros. And the parser of simple mathematical expressions is implemented.

As a result, the parser for BNF grammar:

expr ::= sum sum ::= mul "+" sum | mul "-" sum | mul mul ::= atom "*" mul | atom "/" mul | atom atom ::= "(" expr ")" | number | neg; neg ::= "-" atom

Can be generated using a series of macros:

 rule!(expr, sum); rule!(sum, or![ and![(mul, token('+'), sum) => make_operator], and![(mul, token('-'), sum) => make_operator], mul ]); rule!(mul, or![ and![(atom, token('*'), mul) => make_operator], and![(atom, token('/'), mul) => make_operator], atom ]); rule!(atom, or![ and![(token('('), expr, token(')')) => |_lbrace, stat, _rbrace| Some(stat)], num, neg ]); rule!(neg, and![(token('-'), atom) => |_, number| Some(Box::new(expression::Expression::Negate(number)))]);

In the article, I will intentionally simplify the implementation of some parts of the code and use the unstable features of the Rust night assembly. This, I hope, will simplify understanding and improve readability.

So, as already mentioned, we will generate an LL (*) parser that can analyze a family of grammar of the same name. If you are too lazy to read what is special about this subset of parsers, then in short, they are easier to write with your hands, but they cannot parse the left-recursive grammars (and we don’t need to).

In order to test our parser, we use the above grammar. She can parse arithmetic expressions, she is LL (*) grammar and we have enough for tests.

Let's start with the lexical analyzer (aka "lexer").

Lexical analyzer

For simplicity, our lexer will give us 1 character per line, skipping spaces. We will not use a separate type of token, but we will work with a character type. For the same reason, our numbers can only be one-digit.

For LL (*) grammar, we need a lexical analyzer that can roll back to an arbitrary number of characters. At the input of the lexer, we will take a string, and give the characters one by one. In addition, we will have the function of taking a position and rolling back a lexer to a position.

A lexer could work with a string lazily, but for simplicity, we simply convert the entire string to a vector of characters:

 #[derive(Debug)] pub struct Lexer { /// Input string input: Vec<char>, /// Lexer position position: usize } type Position = usize; impl Lexer { pub fn new(input: &str) -> Self { // Compiler bug. Can't drain string inplace let mut string = String::from(input); let vec = string.drain(..).collect(); Lexer { input: vec, position: 0 } } pub fn position(&self) -> Position { self.position } pub fn next(&mut self) -> Option<char> { while let Some(result) = self.input.get(self.position).cloned() { self.position += 1; if result != ' ' { return Some(result) } } None } pub fn rollback(&mut self, position: Position) { self.position = position } }

Regarding a comment with a bug

At the time of writing the lexer, I wanted to convert the string into a vector in place and the compiler gave an error, say the line goes out of scope. But the fact is that collect absorbs an iterator, so there should be no problems.

I went to the #rust-beginners IRC channel and one of the language developers told me that this is a bug. So if you have any difficulties, go to the channel and ask boldly. Very friendly people are sitting on the Rust IRC channels and they will always try to help you.

The scenario of working with lexer is as follows:

Remember the position of the lexer;
We read the characters and check them in accordance with the rule;
If the character is not accepted by the rule, we roll back to the initial state and return an error.

Time to implement the type of expressions of our ASD .

Expressions

For expressions, I made a few simplifications:

I made the type of expression an enumeration, which I strongly advise against doing in a real compiler. Any serious grammar makes support of a unified type of expression a very complicated process. It is better to use types and implementations;
For numbers, I used the f32 type. Floating point numbers are not always the best choice, but for our purposes f32 is enough for us;
I used an unstable feature #![feature(box_patterns)] . With this syntax, pattern matching looks ~~prettier~~ .

The function for evaluating expressions eval also added.

We will support expressions of numbers, arithmetic operators, and negations:

 #[derive(Debug)] pub enum Expression { Operator { op: Box<Expression>, left: Box<Expression>, right: Box<Expression> }, Number(f32), Token(char), Negate(Box<Expression>) } impl Expression { pub fn eval(self) -> f32 { match self { Expression::Operator { op: box Expression::Token('+'), left, right } => left.eval() + right.eval(), Expression::Operator { op: box Expression::Token('-'), left, right } => left.eval() - right.eval(), Expression::Operator { op: box Expression::Token('/'), left, right } => left.eval() / right.eval(), Expression::Operator { op: box Expression::Token('*'), left, right } => left.eval() * right.eval(), Expression::Number(val) => val, Expression::Negate(exp) => -exp.eval(), token => panic!("Got token inside an expression {:?}", token) } } }

And so, we have a lexically analyzer that gives us tokens and there is a type of expressions. It's time to start the parser, which will turn sequences of characters into expressions.

Syntactical analyzer

Parser implementation is our main task.

All parsing functions will accept the lexer and return the type Option<Box<expression::Expression>> : Some(expression) if the lexer output matches the rule and None if not.

First, consider the auxiliary function, so that they do not distract us. I will hide their implementation under the spoiler and can be viewed via the link in the repository.

Two functions are used to parse the number and compare terminals (symbols ()+-.* ):

Number and terminal parsing functions

 fn num(lexer: &mut lexer::Lexer) -> Option<Box<expression::Expression>> { let parser_pos = lexer.position(); let result = lexer .next() .map(|c| c as char) .and_then(|c| { if c.is_numeric() { Some(Box::new(expression::Expression::Number(c.to_string().parse::<f32>().unwrap()))) } else { lexer.rollback(parser_pos); None }}); result } fn token(token_char: char) -> impl FnMut(&mut lexer::Lexer) -> Option<Box<expression::Expression>> { move |ref mut lexer| { let parser_pos = lexer.position(); lexer .next() .map(|c| c as char).and_then(|c| if c == token_char { Some(Box::new(expression::Expression::Token(c))) } else { lexer.rollback(parser_pos); None }) } }

And another function to create an expression for an arithmetic operation. This functionality is used in several rules, so it is advisable to make it a separate function:

The function of creating an arithmetic operation

 fn make_operator(left: Box<expression::Expression>, op: Box<expression::Expression>, right: Box<expression::Expression>) -> Option<Box<expression::Expression>> { Some(Box::new(expression::Expression::Operator{ op, left, right })) }

Also, the macro debug_parser! macro debug_parser! . It is used to debug the parser (thanks, Captain).

We will define three macros:

rule! to generate a rule function;
or! to generate the select function "|" ;
and! to generate the following function "," .

rule!

Let's start with the rule. The macro generates a function with the specified name, which corresponds to the above signature, so it can in turn be used in another rule or function.

The macro is quite simple: it creates a function that, in turn, when called with a lexer, returns the result of the function passed (sounds more complicated than it actually is).

 macro_rules! rule { ($name: ident, $parse_func:expr) => { fn $name(lexer: &mut lexer::Lexer) -> Option<Box<expression::Expression>> { debug_parser!("Executing rule {}", stringify!($name)); $parse_func(lexer) } }; }

or!

Next macro or! . It accepts the list of rules and returns an unnamed function (it is a lambda function, it is also a closure), which, when called with the parser, calls the passed rules one by one and returns the first positive call result, if any. Otherwise, returns None . The signature of the return closure is the same as for the rule.

If you are not familiar with macros in Rust, you should pay attention to how the list of rules unfolds in the body of the macro. For each rule, the expression $(...),+ expanded once. In our case, this is a block with a function call and a result check. As a result, each transmitted rule will be called once.

Note that the closure remembers the lexer position before calling each rule and rolls it back to the initial state if the rule is not executed:

 macro_rules! or { [$($parse_funcs: expr),+] => { |lexer: &mut lexer::Lexer| -> Option<Box<expression::Expression>> { $( let parser_pos = lexer.position(); let result = $parse_funcs(lexer); if result.is_some() { debug_parser!("Or statement rule {} accepted expression {:?}. Lexer state {:?}", stringify!($parse_funcs), result, lexer); return result } else { lexer.rollback(parser_pos); debug_parser!("Or statement rule {} didn't accept lexer input {:?}", stringify!($parse_funcs), lexer); } )+; debug_parser!("Or statement fails"); None } } }

and!

And finally, the macro and! . Its signature is slightly different from or! . It takes a list of rules and a handler function. The macro returns a closure that calls the passed rules with the lexer and checks that they all return some expression. If all the rules are executed for the lexer, it forms a tuple of results and passes it to the handler function. If at least one rule is not executed or the handler function returns None , the lexer is rolled back to its initial position. The signature of the circuit, according to tradition, is the same as the rule.

The function handler is passed for ease of use. It processes the sequence of expressions and converts them into the desired form for further processing. As an example, you can look at the rule using brackets, which discards the brackets and returns the expression inside the brackets (brackets are only needed to correctly parse the order of calculations).

The handler function is passed through the operator => to improve the readability of the macro call.

 macro_rules! and { [($($parse_funcs: expr),+) => $nandler_func: expr] => { |lexer: &mut lexer::Lexer| -> Option<Box<expression::Expression>> { let parser_pos = lexer.position(); let results = ($(match $parse_funcs(lexer) { Some(expression) => { debug_parser!("And statement rule {} accepted expression {:?}. Lexer state {:?}", stringify!($parse_funcs), expression, lexer); expression } _ => { debug_parser!("And statement rule {} didn't accept lexer input {:?}", stringify!($parse_funcs), lexer); lexer.rollback(parser_pos); return None } }), +); match std::ops::Fn::call(&$nandler_func, results) { expression @ Some(_) => { debug_parser!("And handling function successfully handled expression and returned {:?}", expression); expression } _ => { debug_parser!("And handling function failed to process expressions"); lexer.rollback(parser_pos); return None } } } }; }

Here it is worthwhile to draw on the call std::ops::Fn::call . This is an unstable possibility, but without it we would have to pass a tuple, which is noticeably less convenient.

Now we are ready to express our grammar using macros. Here is the code that was at the beginning of the article:

 // expr = sum // sum = mul "+" sum | mul "-" sum | mul // mul = atom "*" mul | atom "/" mul | atom // atom = "(", expr , ")" | number | neg; // neg = "-" atom rule!(expr, sum); rule!(sum, or![ and![(mul, token('+'), sum) => make_operator], and![(mul, token('-'), sum) => make_operator], mul ]); rule!(mul, or![ and![(atom, token('*'), mul) => make_operator], and![(atom, token('/'), mul) => make_operator], atom ]); rule!(atom, or![ and![(token('('), expr, token(')')) => |_lbrace, stat, _rbrace| Some(stat)], num, neg ]); rule!(neg, and![(token('-'), atom) => |_, number| Some(Box::new(expression::Expression::Negate(number)))]);

The result looks pretty good. If we consider only the parser, we have invested in 65 lines of code. Now it remains to write the test code and run it (yes, I’m not particularly a fan of the test code):

 fn main() { let result0 = expr(&mut lexer::Lexer::new("1 + 2")); let result1 = expr(&mut lexer::Lexer::new("(1 + -2)")); let result2 = expr(&mut lexer::Lexer::new("(1 + 2) * 3")); let result3 = expr(&mut lexer::Lexer::new("1 * (2 - 3)")); let result4 = expr(&mut lexer::Lexer::new("1 * -2 + 3 * 4")); let result5 = expr(&mut lexer::Lexer::new("(1 * 2 + (-3 + -4))")); println!("0. Result {:?}", result0); println!("1. Result {:?}", result1); println!("2. Result {:?}", result2); println!("3. Result {:?}", result3); println!("4. Result {:?}", result4); println!("5. Result {:?}", result5); assert_eq!(result0.unwrap().eval(), 1f32 + 2f32); assert_eq!(result1.unwrap().eval(), 1f32 - 2f32); assert_eq!(result2.unwrap().eval(), (1f32 + 2f32) * 3f32); assert_eq!(result3.unwrap().eval(), 1f32 * (2f32 - 3f32)); assert_eq!(result4.unwrap().eval(), 1f32 * -2f32 + 3f32 * 4f32); assert_eq!(result5.unwrap().eval(), 1f32 * 2f32 + (-3f32 + -4f32)); }

Conclusion:

 0. Result Some(Operator { op: Token('+'), left: Number(1), right: Number(2) }) 1. Result Some(Operator { op: Token('+'), left: Number(1), right: Negate(Number(2)) }) 2. Result Some(Operator { op: Token('*'), left: Operator { op: Token('+'), left: Number(1), right: Number(2) }, right: Number(3) }) 3. Result Some(Operator { op: Token('*'), left: Number(1), right: Operator { op: Token('-'), left: Number(2), right: Number(3) } }) 4. Result Some(Operator { op: Token('+'), left: Operator { op: Token('*'), left: Number(1), right: Negate(Number(2)) }, right: Operator { op: Token('*'), left: Number(3), right: Number(4) } }) 5. Result Some(Operator { op: Token('+'), left: Operator { op: Token('*'), left: Number(1), right: Number(2) }, right: Operator { op: Token('+'), left: Negate(Number(3)), right: Negate(Number(4)) } })

Post Scriptum

Macros in Rust leave a good impression. Sometimes it seems that there is a lack of some fundamental structures. For example, expand a block N times without inserting a parameter, where N is the number of arguments.

But the developers quite quickly add the required features, so there is hope (for example, they will soon add HKT and non-lexical scopes).

All code can be viewed on GitHub .

Source: https://habr.com/ru/post/349054/

All Articles