Writing a parser from scratch: is the devil so bad?

In the past topic, I talked about how my friend and I decided to write my embedded programming language for the .NET platform for fun. The first version had a serious drawback - the parser was implemented on F # using a third-party library. Because of this, a lot of dependencies were needed, the parser was slow, and its support was extremely dreary.

Obviously, the parser needed to be rewritten in C #, but at the thought of writing the parser from scratch, there were suddenly a dozen other urgent tasks. Thus, the task was thrown over and postponed for almost half a year and seemed overwhelming, and in the end was made in 4 days. Under the cat, I will tell you about a convenient way, which allowed you to implement a parser of a rather complex grammar without using third-party libraries and not moving with your mind, as well as how it improved LENS language.

But first things first.

First pancake

As mentioned above, we used the FParsec library as the core of the parser. The reasons for this choice are more historical than objective: I liked the lightweight syntax, I wanted to practice using F #, and the author of the library very quickly answered a few questions by email.
')
The main disadvantage of this library for our project was external dependencies:

Approximately ten megabyte F # Runtime
450 kb assemblies of FParsec itself

In addition, the compiler itself crashed into 3 assemblies: a parser, a syntax tree, and an entry point. Building the parser took an impressive 130 KB. For embedded language, this is absolutely indecent.

Another problem was the display of errors. A laconic grammar record on a local DSL with an incorrectly entered program produced an unreadable error with a listing of the expected lexemes:

> let x = > :           'new'  ...

Although custom error handling is possible, DSL is not intended for it. The description of the grammar is ugly swelling and becomes completely unsupported.

Another unpleasant moment was the speed of work. With the “cold start” compilation of any, even the simplest script, took about 350-380 milliseconds on my machine. Judging by the fact that the rerun of the same script took only 5-10 milliseconds, the delay was caused by JIT compilation.

At once I will make a reservation - for most real-time tasks, development time is much more critical than a couple of additional libraries or hundreds of milliseconds spent on parsing. From this point of view, writing a hand-to-hand parser is rather a learning or esoteric exercise.

Some theory

A spherical parser in a vacuum is a function that accepts the source code, and returns some intermediate representation, according to which it will be convenient to generate code for the virtual machine or processor being used. Most often, this representation has a tree structure and is called an abstract syntax tree - ASD (in the foreign literature - abstract syntactic tree, AST).

The tree structure is especially good in that its crawl in depth fits perfectly with the stack organization used in many modern virtual machines (for example, JVM or .NET). Code generation in this article will not be considered, but the elements of the syntax tree, as a result of the parser, will be mentioned from time to time.

So, at the entrance we have a string. Character set Working with it in this form directly is not very convenient - you have to take into account spaces, line breaks and comments. To simplify their lives, parser developers usually divide the parsing into several passes , each of which performs one simple task and transfers the result of its work to the following:

Lexical analyzer: string -> IEnumerable<Lexem>
Parser: IEnumerable<Lexem> -> IEnumerable<Node>
Semantic analyzer: IEnumerable<Node> -> ?

Since the semantic analyzer is a purely individual thing, its description is not included in this article. However, I will share some useful tricks for the first two analyzers.

Lexical analyzer

Requirements:

Work speed
Ease of expansion
Ease of implementation
Position tracking in source code

The lexer algorithm is simple: it scans the string from left to right, trying to match the current position in the string with each lexeme it knows. With a successful match, the lexer shifts the string to the right for as many characters as the previous lexeme took, and continues the search for a new one to the end of the string. Spaces, tabs, and line breaks can be ignored for most grammars.

All tokens should initially be divided into 2 types - static and dynamic . The first are those tokens that can be expressed in the usual line - keywords and operators. Tokens such as identifiers, numbers, or strings are easier to describe with a regular expression.

Static lexemes, in turn, have a reason to divide into operators and keywords . Keywords are matched only if the character following them is not valid for the identifier (or further - the end of the line). Otherwise, there will be problems with identifiers whose beginning coincides with the keyword: for example, "information" -> keyword(in), keyword(for), identifier(mation) .

Implementation example

 enum LexemKind { Var, Print, Plus, Minus, Multiply, Divide, Assign, Semicolon, Identifier, Number } class LocationEntity { public int Offset; public int Length; } class Lexem : LocationEntity { public LexemKind Kind; public string Value; } class LexemDefinition<T> { public LexemKind Kind { get; protected set; } public T Representation { get; protected set; } } class StaticLexemDefinition : LexemDefinition<string> { public bool IsKeyword; public StaticLexemDefinition(string rep, LexemKind kind, bool isKeyword = false) { Representation = rep; Kind = kind; IsKeyword = isKeyword; } } class DynamicLexemDefinition : LexemDefinition<Regex> { public DynamicLexemDefinition(string rep, LexemKind kind) { Representation = new Regex(@"\G" + rep, RegexOptions.Compiled); Kind = kind; } } static class LexemDefinitions { public static StaticLexemDefinition[] Statics = new [] { new StaticLexemDefinition("var", LexemKind.Var, true), new StaticLexemDefinition("print", LexemKind.Print, true), new StaticLexemDefinition("=", LexemKind.Assign), new StaticLexemDefinition("+", LexemKind.Plus), new StaticLexemDefinition("-", LexemKind.Minus), new StaticLexemDefinition("*", LexemKind.Multiply), new StaticLexemDefinition("/", LexemKind.Divide), new StaticLexemDefinition(";", LexemKind.Semicolon), }; public static DynamicLexemDefinition[] Dynamics = new [] { new DynamicLexemDefinition("[a-zA-Z_][a-zA-Z0-9_]*", LexemKind.Identifier), new DynamicLexemDefinition("(0|[1-9][0-9]*)", LexemKind.Number), }; } class Lexer { private char[] SpaceChars = new [] { ' ', '\n', '\r', '\t' }; private string Source; private int Offset; public IEnumerable<Lexem> Lexems { get; private set; } public Lexer(string src) { Source = src; Parse(); } private void Parse() { var lexems = new List<Lexem>(); while(InBounds()) { SkipSpaces(); if(!InBounds()) break; var lex = ProcessStatic() ?? ProcessDynamic(); if(lex == null) throw new Exception(string.Format("Unknown lexem at {0}", Offset)); lexems.Add(lex); } Lexems = lexems; } private void SkipSpaces() { while(InBounds() && Source[Offset].IsAnyOf(SpaceChars)) Offset++; } private Lexem ProcessStatic() { foreach(var def in LexemDefinitions.Statics) { var rep = def.Representation; var len = rep.Length; if(Offset + len > Source.Length || Source.Substring(Offset, len) != rep) continue; if(Offset + len < Source.Length && def.IsKeyword) { var nextChar = Source[Offset + len]; if(nextChar == '_' || char.IsLetterOrDigit(nextChar)) continue; } Offset += len; return new Lexem { Kind = def.Kind, Offset = Offset, Length = len }; } return null; } private Lexem ProcessDynamic() { foreach(var def in LexemDefinitions.Dynamics) { var match = def.Representation.Match(Source, Offset); if(!match.Success) continue; Offset += match.Length; return new Lexem { Kind = def.Kind, Offset = Offset, Length = match.Length, Value = match.Value }; } return null; } private bool InBounds() { return Offset < Source.Length; } }

Benefits:

Works fast
Elementary device, you can write for half an hour
New lexemes are added very simply.
The method is suitable for many grammars

Disadvantages:

Dances with a tambourine when parsing a language with significant spaces
The order of the declaration of tokens is important: it is desirable to sort by length

Syntactical analyzer

Requirements:

Ease of expansion when changing grammar
Ability to describe detailed error messages
The ability to look forward to an unlimited number of positions
Automatic tracking of the position in the source code
Conciseness, proximity to the original grammar

In order to simplify your life when writing a parser, you should arrange the grammar in a special way. No complicated designs! All rules can be divided into 3 types:

Description - one specific node:
(var_expr = "var" identifier "=" expr)
Repetition - one specific node is repeated many times, possibly with a separator:
main = { stmt ";" }
Alternative - select from multiple nodes
(stmt = var_expr | print_expr | assign_expr | other)

The description rule is sorted out in steps: checked the type of the current token, moved to the next one, and so on until the end of the rule. With each checked lexeme, you can take some action: give a detailed error in the event of a mismatch, save its value in the node, etc.

The enumeration rule is a cycle. To return a sequence of values, in C # there is a very convenient functionality for creating generators using yield return .

The alternate rule in turn calls the variant rules using a special wrapper that allows you to roll back to its original state. The rules are simply called in order, until at least one of them matches, connected by the coalesce ( ?? ) operator.

Here an inquisitive reader will ask:
- How is it, just called in order? But what about advanced checks? For example:

 if(CurrentLexem.Type == LexemType.Var) return parseVar(); if(CurrentLexem.Type == LexemType.For) return parseFor(); ...

I admit, I wrote my first serious parser that way. However, this is a bad idea!

First, you can only look at a fixed number of characters. For any for or var , of course, fit. But, let's say, we have such rules in grammar:

 assign = id_assign | member_assign | index_assign id_assign = identifier "=" expr member_assign = lvalue "." identifier "=" expr index_assign = lvalue "[" expr "]" "=" expr

If everything is clear with id_assign , then the other two rules start with a non-terminal lvalue , which can hide a kilometer expression. Obviously, there are no advance checks here.

Another problem is the mixing of areas of responsibility. For grammar to be extensible, the rules must be as independent as possible from each other. This approach requires that the external rule is aware of the composition of the internal ones, which increases coherence and complicates support when changing grammar.

So why do we need advanced checks? Let every rule itself know how far you need to look ahead to make sure that it is the most appropriate.

Consider the example above. Suppose we have the text: a.1 = 2 :

The first is the alternative called id_assign .
Identifier a successfully matches.
Next comes the point, and the expected is an equal sign. However, other rules may begin with an identifier, so the error is not thrown away.
The assign rule rolls back the state and tries further.
An alternative to member_assign .
Identifier and dot successfully match. There are no other rules in the grammar that start with an identifier and a dot, so it makes no sense to try to make further errors by rolling back the state.
The number 1 not an identifier, so an error is thrown out.

First, we write several useful methods:

Hidden text

 partial class Parser { private List<Lexem> Lexems; private int LexemId; #region Lexem handlers [DebuggerStepThrough] private bool Peek(params LexemType[] types) { var id = Math.Min(LexemId, Lexems.Length - 1); var lex = Lexems[id]; return lex.Type.IsAnyOf(types); } [DebuggerStepThrough] private Lexem Ensure(LexemType type, string msg, params object[] args) { var lex = Lexems[LexemId]; if(lex.Type != type) error(msg, args); Skip(); return lex; } [DebuggerStepThrough] private bool Check(LexemType lexem) { var lex = Lexems[LexemId]; if (lex.Type != lexem) return false; Skip(); return true; } [DebuggerStepThrough] private void Skip(int count = 1) { LexemId = Math.Min(LexemId + count, Lexems.Length - 1); } #endregion #region Node handlers [DebuggerStepThrough] private T Attempt<T>(Func<T> getter) where T : LocationEntity { var backup = LexemId; var result = Bind(getter); if (result == null) LexemId = backup; return result; } [DebuggerStepThrough] private T Ensure<T>(Func<T> getter, string msg) where T : LocationEntity { var result = Bind(getter); if (result == null) throw new Exception(msg); return result; } [DebuggerStepThrough] private T Bind<T>(Func<T> getter) where T : LocationEntity { var startId = LexemId; var start = Lexems[LexemId]; var result = getter(); if (result != null) { result.StartLocation = start.StartLocation; var endId = LexemId; if (endId > startId && endId > 0) result.EndLocation = Lexems[LexemId - 1].EndLocation; } return result; } #endregion }

With their help, the implementation of the above grammar becomes almost trivial:

 partial class Parser { public Node ParseAssign() { return Attempt(ParseIdAssign) ?? Attempt(ParseMemberAssign) ?? Ensure(ParseIndexAssign, "  !"); } public Node ParseIdAssign() { var id = TryGetValue(LexemType.Identifier); if (id == null) return null; if (!Check(LexemType.Assign)) return null; var expr = Ensure(ParseExpr, "  !"); return new IdAssignNode { Identifier = id, Expression = expr }; } public Node ParseMemberAssign() { var lvalue = Attempt(ParseLvalue); if (lvalue == null) return null; if (!Check(LexemType.Dot)) return null; var member = TryGetValue(LexemType.Identifier); if (member == null) return null; if (!Check(LexemType.Assign)) return null; var expr = Ensure(ParseExpr, "  !"); return new MemberAssignNode { Lvalue = lvalue, MemberName = member, Expression = expr }; } public Node ParseIndexAssign() { var lvalue = Attempt(ParseLvalue); if (lvalue == null) return null; if (!Check(LexemType.SquareBraceOpen)) return null; var index = Ensure(ParseExpr, "  !"); Ensure(LexemType.SquareBraceClose, "  !"); Ensure(LexemType.Assign, "  !"); var expr = Ensure(ParseExpr, "  !"); return new IndexAssignNode { Lvalue = lvalue, Index = index, Expression = expr }; } }

The DebuggerStepThrough attribute helps a lot when debugging. Since all calls to nested rules somehow pass through Attempt and Ensure, without this attribute, they will always be noticeable during Step Into and fill in the call stack.

The advantages of this method:

Status rollback is a very cheap operation.
Easy to control where to roll back
Easy to display detailed error messages.
No external libraries required.
Small amount of generated code

Disadvantages:

Manual implementation of the parser takes time
The complexity of writing and optimal performance depend on the quality of the grammar
Left-recursive grammars should be resolved independently

Operators and Priorities

Repeatedly I have seen in the descriptions of grammar the following rules, which indicate the priority of operations:

 expr = expr_1 { op_1 expr_1 } expr_1 = exp2_2 { op_2 expr_2 } expr_2 = exp2_3 { op_3 expr_3 } expr_3 = int | float | identifier op_1 = "+" | "-" op_2 = "*" | "/" | "%" op_3 = "**"

Now imagine that we still have boolean operators, comparison operators, shift operators, binary operators, or some proper ones. How many rules are obtained, and how much will you have to change, if you suddenly have to add a new operator with priority somewhere in the middle?

Instead, you can remove from the grammar in general all the description of priorities and code it declaratively.

Implementation example

 expr = sub_expr { op sub_expr } sub_expr = int | float | identifier

 partial class Parser { private static List<Dictionary<LexemType, Func<Node, Node, Node>>> Priorities = new List<Dictionary<LexemType, Func<Node, Node, Node>>> { new Dictionary<LexemType, Func<Node, Node, Node>> { { LexemType.Plus, (a, b) => new AddNode(a, b) }, { LexemType.Minus, (a, b) => new SubtractNode(a, b) } }, new Dictionary<LexemType, Func<Node, Node, Node>> { { LexemType.Divide, (a, b) => new DivideNode(a, b) }, { LexemType.Multiply, (a, b) => new MultiplyNode(a, b) }, { LexemType.Remainder, (a, b) => new RemainderNode(a, b) } }, new Dictionary<LexemType, Func<Node, Node, Node>> { { LexemType.Power, (a, b) => new PowerNode(a, b) } }, }; public NodeBase ProcessOperators(Func<Node> next, int priority = 0) { if (priority == Priorities.Count) return getter(); var node = ProcessOperators(next, priority + 1); var ops = Priorities[priority]; while (Lexems[LexemId].IsAnyOf(ops.Keys)) { foreach (var curr in ops) { if (check(curr.Key)) { node = curr.Value( node, ensure(() => ProcessOperators(next, priority + 1), " !") ); } } } return node; } }

Now, to add a new operator, it is only necessary to add the appropriate line to initialize the list of priorities.

Adding support for unary prefix operators is left as a training for particularly curious.

What did it give us?

The handwritten parser, oddly enough, has become much easier to maintain. Added a rule to the grammar, found the appropriate place in the code, added its use. Backtracking hell, which often arose when adding a new rule in the old parser and caused a sudden heap of a seemingly unrelated test, was left in the past.

Total comparative table of results:

Parameter	FParsec Parser	Pure C #
Parsing time at 1 run	220 ms	90 ms
Parsing time during further runs	5 ms	6 ms
Library size required	800 KB + F # Runtime	260 KB

Most likely, it is possible to optimize and squeeze more performance out of the parser, but so far this result is quite satisfactory.

Having got rid of a headache with changes in grammar, we were able to wash down some pleasant things in LENS :

Cycle for

It is used both to bypass sequences, and for ranges:

 var data = new [1; 2; 3; 4; 5] for x in data do println "value = {0}" x for x in 1..5 do println "square = {0}" x

Composition of functions

With the help of the operator :> you can create new functions, “stringing” the existing ones:

 let invConcat = (a:string b:string) -> b + a let invParse = incConcat :> int::Parse invParse "37" "13" // 1337

Partial application is possible using anonymous functions:

 fun add:int (x:int y:int) -> x + y let addTwo = int::TryParse<string> :> (x:int -> add 2 x) addTwo "40" // 42

Syntax improvements

Single line comments:
```
 somecode () // comment 
```
Uninitialized variables:
```
 var x : int 
```
The brackets around the single lambda argument are optional:
```
 var inc = x:int -> x + 1 
```

The brackets in the control structures are removed:

 if x then a () else b () while a < b do println "a = {0}" a a = a + 1

A try/finally block appeared.
The brackets when passing an index or field to a function are optional:
```
 print "{0} = {1}" a[1] SomeType::b 
```

The project, though slowly, is developing. There are still many interesting tasks. The following version is planned:

Declaration of generic types and functions
Ability to mark functions or types of attributes
Event support

You can also download the collected demos under Windows .

Source: https://habr.com/ru/post/202622/

All Articles