📜 ⬆️ ⬇️

We build DSL on C # using parser combinators


Translation of the article by Nicholas Blumhardt, the well-known .NET developer, author of Autofac IoC / DI container. In this article, Nicholas shows with a real-life example how to write a parser of a domain-specific programming language with the least effort using Sprache, a parser-combinator library.


Our current project includes a small process for submitting and approving applications for creating user accounts. This is a good example for discussing domain-specific languages ​​and Sprache . Now I will describe some requirements.

The set of user account types is unlimited; currently it is “employee”, “contractor”, “temporary employee” and so on. To obtain an account, the user must fill out the appropriate form.
')
When collecting and saving the questionnaire data, its content does not matter unless the relevant information is presented to the administrator, who will eventually approve or reject the application.


Link to an example in the form of a solution for VS2010.

Difficulties


In many ways, the design of the system is due to the fact that the set of possible types of accounts (and, therefore, various questionnaires) is unlimited. Creating new profiles should be possible without redeploying the application. In addition, the composition of the questionnaires should be easily changed, perhaps by the end users themselves.

There are many possible ways to submit questionnaires:

Each of these methods has its advantages and disadvantages associated with ease of implementation, maintainability, convenience and flexibility.

In this article we will consider another attractive option: the creation of a convenient mini-language of the description of the questionnaires.

Profile Description Language


You may have read some discussions of the differences between internal and external DSL.

Internal DSL is a purpose-built API in a general purpose language (such as C #), which when used reads more like a description of a problem than a program that solves it.

External DSL is a separate language that needs to be parsed from the source text before the program can work with it. Also, which is especially important for our task, external DSL minimizes syntax noise and can be read without recompiling the program.

An example of a DSL form looks like this:

identification "Personal Details" [ name "Full Name" department "Department" ] employment "Current Employer" [ name "Your Employer" contact "Contact Number" #months "Total Months Employed" ] 

This is a two-step questionnaire that collects personal data and employment details.

The application will read the description of the questionnaire associated with the appropriate type of account, and provide the user with a step-by-step interface.

Approaches to the analysis of questionnaire descriptions


Debriefing is the process of adopting text in a source language, such as the questionnaire above and transforming it into some kind of representation, usually into some kind of object model with which the program can work. For a C # programmer, there are several ways to achieve this goal.

Handwritten parsers

Both the simplest and the most complex parsers are often written by hand. Simple ones are written when the solution is obvious (for example, “break a string by finding commas in a loop”). The most complex parsers are written when the programmer needs an extraordinary level of control (for example, the C # compiler). Parsing anything intermediate by hand is usually not worth the effort, unless of course you are a professional in this field, knowing exactly what he is doing (this is definitely not about me!)

Regular expressions

This is a convenient way to match and extract patterns from text. .NET includes the built-in System.Text.Regex class for efficiently working with regular expressions, so they are usually the first option to consider when someone is faced with the task of parsing. Despite the fairly simple grammar, regular expressions quickly become difficult to read and maintain. This is perhaps their biggest drawback. In addition, there are many grammars that regular expressions are not able to analyze (starting with those that allow nesting).

Parser Generators

Parser generators, “language toolkits,” allow you to specify grammar in a declarative format. Tulkit includes tools that work during project building, which generate classes in a target language (such as C #) that can parse grammar. Using such tools takes time to study their work and integrate them into the project building process. For small parsing tasks this may be unnecessary, but for something much more complex or requiring high parsing speed, learning and using such tools is strongly recommended.

Parser Combiners

This feature-based technique is often used by default in functional languages ​​such as Haskell and F #, both of which have high-quality parser combinator libraries. C # also has a young Sprache combinator library developed by your humble servant and used later in the article. Sprache makes it very easy to write and maintain simple parsers without a steep learning curve or embedding into the build process. It goes well with the development process through testing. The current drawbacks include performance and, sometimes, the quality of error messages - none of them is a big problem for parsing small DSLs. [ Update: since this article was written, the speed of parsing and error handling in Sprache has been greatly improved. ]

Getting started


To get started, download Sprache.dll . This article is organized in such a way that you can follow the text, creating and testing parsers in Visual Studio with NUnit .

Grammar are built from the bottom up. First, parsers are created for low-level syntax elements, such as identifiers, strings, and numbers. Then we gradually rise above, combining these simple parts into more complex ones, until we get a complete language.

Parsing id


In our questionnaire description language, the most nested meaningful element is the question:

 name "Full Name" 


The main parts here are an identifier and some text in quotes. The first task for our parser is to parse the identifier, in this case 'name'.

 [Test] public void AnIdentifierIsASequenceOfCharacters() { var input = "name"; var id = QuestionnaireGrammar.Identifier.Parse(input); Assert.AreEqual("name", id); } 

Sprache parsers are static class methods representing grammars. QuestionnaireGrammar.Identifier of type Parser <string> , i.e. it returns string values.

Parser definition:

 public static readonly Parser<string> Identifier = Parse.Letter.AtLeastOnce().Text().Token(); 

This code reads quite well - we are going to analyze a non-empty sequence of letters and return their textual representation. Here are the elements of the parser:

Parse.Letter - in Sparche, the Parse class contains helper methods and properties for performing common parsing tasks. Letter is a simple parser of type Parser <char> that reads a letter from the input and returns it as a char . If the input character is not a letter, this parser will not match it.

AtLeastOnce () - all created with Sprache parsers support repetition. AtLeastOnce () takes a parser of a single element of type T and returns a new parser that will parse the sequence of such elements, returning an IEnumerable <T> .

Text () - the AtLeastOnce () modifier takes our Parser <char> and turns it into a parser of type Parser <IEnumerable <char >> . The auxiliary function Text () takes a parser of type Parser <IEnumerable <char >> and converts it to Parser <string> , for more convenient work.

Token () is a modifier that accepts and then discards initial and final whitespace characters.

Simply?

There are some more interesting tests on the parser id.

 [Test] public void AnIdentifierDoesNotIncludeSpace() { var input = "a b"; var parsed = QuestionnaireGrammar.Identifier.Parse(input); Assert.AreEqual(“a”, parsed); } 

Each parser in Sprache will parse as much input as it can. In this test, parsing will end successfully, but will not absorb the entire input string. (Later we will see how to require the end of input.)

 [Test] public void AnIdentifierCannotStartWithQuote() { var input = "\"name"; Assert.Throws<ParseException>(() => QuestionnaireGrammar.Identifier.Parse(input)); } 

The Parse () extension method throws a ParseException if the parser does not fit. You can also use non-throwing exceptions TryParse () .

After we were delighted to correctly parry identifier, we can move on.

Parsing text in quotes


Parsing text in quotes is not much more complicated than parsing an identifier - our simple version does not support escaped characters or anything intricate.

Look again at the input line:

 name "Full Name" 

To parse the quoted text, we need to map:
  1. Opening quotes
  2. Everything except for the other quotes
  3. Closing quotes

Quotes by themselves are not particularly interesting to us, so we will only return text between them.

The test for the parser will look like this:

 [Test] public void QuotedTextReturnsAValueBetweenQuotes() { var input = "\"this is text\""; var content = QuestionnaireGrammar.QuotedText.Parse(input); Assert.AreEqual("this is text", content); } 

Let's go straight to the analyzer:

 public static readonly Parser<string> QuotedText = (from open in Parse.Char('"') from content in Parse.CharExcept('"').Many().Text() from close in Parse.Char('"') select content).Token(); 


Such a convenient reuse of LINQ query syntax was first described (to my knowledge) by Luke Hoban from the F # command. The from operations split individual syntax units, and select transforms them into the return value of the entire parser.

Parsim question


You may have already noticed that the parsers are strongly typed: the parser for the character returns char , and the parser for the text returns string . The parser for the question will return a Question !

 public class Question { public Question(string id, string prompt) { Id = id; Prompt = prompt; } public string Id { get; private set; } public string Prompt { get; private set; } } 

This is a big advantage of combinator based analysis. Once the semantic task model is built, the parser can translate input data directly into it.

 public static readonly Parser<Question> Question = from id in Identifier from prompt in QuotedText select new Question(id, prompt); 

The unit test for the question now passes:

 [Test] public void AQuestionIsAnIdentifierFollowedByAPrompt() { var input = "name \"Full Name\""; var question = QuestionnaireGrammar.Parse(input); Assert.AreEqual("name", question.Id); Assert.AreEqual("Full Name", question.Prompt); } 

Parsing section


The analysis of the section is the same as the analysis of the question: first we build a semantic model, and then, using existing parsers, we translate the source text into it.

Let me remind you that the section looks like this:

 identification "Personal Details" [ name "Full Name" department "Department" ] 

We can present it in the object model as follows:

 public class Section { public Section(string id, string title, IEnumerable<Question> questions) { Id = id; Title = title; Questions = questions; } public string Id { get; private set; } public string Prompt { get; private set; } public IEnumerable<Question> Questions { get; private set; } } 

Parser development is as simple as developing an object model:

 public static readonly Parser<Section> Section = from id in Identifier from title in QuotedText from lbracket in Parse.Char('[').Token() from questions in Question.Many() from rbracket in Parse.Char(']').Token() select new Section(id, title, questions); 

To complete the example, we have another model class:

 public class Questionnaire { public Questionnaire(IEnumerable<Section> sections) { Sections = sections; } public IEnumerable<Section> Sections { get; private set; } } 

The corresponding parser (this time without syntax parsing):

 public static Parser<Questionnaire> Questionnaire = Section.Many().Select(sections => new Questionnaire(sections)).End(); 


The .End () modifier requires that all input data be parsed (i.e. no trash is left at the end).

That's all we need for an example, without data type qualifiers.

Support for response data types


The final touch to our grammar will be support for type qualifiers of the answer.

 #months "Total Months Employed" 

For their presentation, we can use an enumeration of all possible types.

 public enum AnswerType { Natural, Number, Date, YesNo } 

This is a rather limited set, so using brute force we will check every possible qualifier.

 public static Parser<AnswerType> AnswerTypeIndicator = Parse.Char('#').Return(AnswerType.Natural) .Or(Parse.Char('$').Return(AnswerType.Number)) .Or(Parse.Char('%').Return(AnswerType.Date)) .Or(Parse.Char('?').Return(AnswerType.YesNo)); 


The Question class is changed to accept AnswerType as a constructor parameter. A simple modification of the question parser completes our work.

 public static Parser<Question> Question = from at in AnswerTypeIndicator.Or(Parse.Return(AnswerType.Text)) from id in Identifier from prompt in QuotedText select new Question(id, prompt, at); 

Summary


The complete parser is just six rules in 25 well-formed lines of code.

Although in the real world, reliable parsing is a non-trivial task, I hope this article has shown that there are simple options that fill in the gaps between regular expressions and language tools.

Source: https://habr.com/ru/post/127642/


All Articles