DSL for regular expressions on Kotlin

Hello!

This article is about the implementation of one specific DSL ( domain specific language ) for regular expressions using Kotlin, but it may well give a general idea of how to write your DSL on Kotlin and what it usually will do "under the hood "any other DSL using the same language features.

Many have already used Kotlin, or at least tried to do it, and the rest could well have heard that Kotlin has to write elegant DSL, which has some brilliant examples - Anko and kotlinx.html .

Of course, for regular expressions, this has already been done (and yet: in Java , in Scala , in C # - there are many implementations, it seems, this is a common entertainment). But if you want to practice or try Kotlin's DSL-oriented language capabilities, then welcome to Cat.

What does DSL usually look like written in Kotlin?

In the worst case, probably so.

Most Java DSLs suggest using call chains for their constructs, as in this example Java Regex DSL:

Regex regex = RegexBuilder.create() .group("#timestamp") .number("#hour").constant(":") .number("#min").constant(":") .number("#secs").constant(":") .number("#ms") .end() .build();

We could use this approach, but it has a number of disadvantages, among which we can immediately note two:

inconvenient implementation of nested constructs ( group and end above), due to which you will have to fight with the formatter, and there is no banal verification of the opening and closing elements, you can write an extra .end() ;
poor possibilities for dynamic expression: if we want to execute arbitrary code before adding the next part to the query — for example, checking the condition — we will need to break the chain of calls and store the partially created expression in a variable.

With these shortcomings in Kotlin you can cope if you implement DSL in the style of Type-Safe Groovy-Style Builder (when explaining technical details, this article will largely repeat the documentation page by reference). Then the code on it will look like this Anko example:

Show code

 verticalLayout { val name = editText() button("Say Hello") { onClick { toast("Hello, ${name.text}!") } } }

Or to this example kotlinx.html:

Show code

 html { body { div { a("http://kotlinlang.org") { target = ATarget.blank +"Main site" } } } }

Looking ahead, I’ll say that the resulting language will look something like this:

Show code:

 val r = regex { group("prefix") { literally("+") oneOrMore { digit(); letter() } } 3 times { digit() } literally(";") matchGroup("prefix") }

Let's get started

 class RegexContext { } fun regex(block: RegexContext.() -> Unit): Regex { throw NotImplementedError() }

What is written here? The regex function returns the constructed Regex and takes a single argument — another function of the type RegexContext.() -> Unit . If you are already familiar with Kotlin, feel free to skip a couple of paragraphs explaining what it is.

The types of functions in Kotlin are written something like this: (Int, String) -> Boolean is a predicate of two arguments - or so: SomeType.(Int) -> Unit is a function that returns a Unit (analog of the void-function), and except the Int argument also accepts a receiver of type SomeType .

Functions that take a receiver help us a lot with building DSL due to the fact that you can pass a lambda expression as an argument of this type, and it will have an implicit this that is of the same type as the receiver. A simple example is the library function with :

 fun <T, R> with(t: T, block: T.() -> R): R = t.block() //  block       receiver //    ,      with(ArrayList<Int>()) { for(i in 1..10) { add(i) } //  add    receiver println(this) //    this --  ArrayList<Int> }

Great, now we can call regex { ... } and, inside the curly braces, work with some RegexContext instance as if this . It remains just a little - to implement the members RegexContext . :)

Why do I need a RegexContext?

Let's create a regular expression in parts - each statement of our DSL will simply append to the unfinished expression the next part. These parts will be stored by RegexContext .

 class RegexContext { internal val regexParts = mutableListOf<String>() //    private fun addPart(part: String) { //       regexParts.append(part) } }

Accordingly, the regex {...} function will now look like this:

 fun regex(block: RegexContext.() -> Unit): Regex { val context = RegexContext() context.block() //  block,  -   context val pattern = context.regexParts.toString() return Regex(pattern) //     -  Regex }

Next, we implement the RegexContext functions that add different parts to the regular expression.

The following functions, unless explicitly stated otherwise, are also located in the body of the class.

Everything is very simple

Right?

 fun anyChar(s: String) = addPart(".")

This call simply adds a point to the expression, which denotes a subexpression corresponding to any single character.

Similarly, we implement the functions digit() , letter() , alphaNumeric() , whitespace() , wordBoundary() , wordCharacter() and even startOfString() and endOfString() - they all look about the same.

Namely:

 fun digit() = addPart("\\d") fun letter() = addPart("[[:alpha:]]") fun alphaNumeric() = addPart("[A-Za-z0-9]") fun whitespace() = addPart("\\s") fun wordBoundary() = addPart("\\b") fun wordCharacter() = addPart("\\w") fun startOfString() = addPart("^") fun endOfString() = addPart("$")

But to add an arbitrary string to a regular expression, you have to first convert it so that the characters present in the string are not interpreted as service characters. The easiest way to do this is with the Regex.escape (...) function:

 fun literally(s: String) = addPart(Regex.escape(s))

For example, literally(".:[test]:.") will add a part of \Q.:[test]:.\E to the expression.

Go deeper

What about quantifiers? Obvious observation: the quantifier is hung on a subexpression, which in itself is also a valid regex. Let's add some nesting!

We want the nested block of code in curly braces to specify a subexpression of a quantifier, like this:

 val r = regex { oneOrMore { optional { anyChar() } literally("-") } literally(";") }

We will do this with the help of the RegexContext functions, which behave almost the same as regex {...} , but use the constructed subexpression themselves. We first add auxiliary functions:

  private fun addWithModifier(s: String, modifier: String) { addPart("(?:$s)$modifier") //  non-capturing group   } private fun pattern(block: RegexContext.() -> Unit): String { //      --     val innerContext = RegexContext() innerContext.block() // block    RegexContext return innerContext.regexParts.toString() //      }

And then we use them to implement our "quantifiers":

 fun optional(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "?") fun oneOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "+")

And so on (plus, functions that allow you not to wrap literally into lambda

 fun oneOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "+") fun oneOrMore(s: String) = oneOrMore { literally(s) } fun optional(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "?") fun optional(s: String) = optional { literally(s) } fun zeroOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "*") fun zeroOrMore(s: String) = zeroOrMore { literally(s) }

Even in regexes, it is possible to set the number of expected entries precisely or using a range. We want this too, right? It is also a good reason to use infix functions - functions of two arguments, one of which is receiver. Calls to such functions will be as follows:

 val r = regex { 3 times { anyChar() } 2 timesOrMore { whitespace() } 3..5 times { literally("x") } // 3..5 --  IntRange }

And the functions themselves are declared as:

 infix fun Int.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{$this}") infix fun IntRange.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{${first},${last}}")

And all together, again with functions for literal strings:

 infix fun Int.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{$this}") infix fun Int.times(s: String) = this times { literally(s) } infix fun IntRange.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{${first},${last}}") infix fun IntRange.times(s: String) = this times { literally(s) } infix fun Int.timesOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{$this,}") infix fun Int.timesOrMore(s: String) = this timesOrMore { literally(s) } infix fun Int.timesOrLess(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{0,$this}") infix fun Int.timesOrLess(s: String) = this timesOrLess { literally(s) }

Group up!

The tool for working with rehexes cannot be called such if it does not support groups, so let's support them, for example, in this form:

 val r = regex { anyChar() val separator = group { literally("+"); digit() } //    anyChar() matchGroup(separator) //    anyChar() }

However, the groups introduce new complexity into the regex structure: they are numbered "through and through" from left to right, ignoring the nesting of subexpressions. This means that the calls of group {...} cannot be considered independent of each other, and even more: all of our nested subexpressions are now also linked to each other.

To support the numbering of groups, slightly change the RegexContext : now it will remember how many groups it already has:

 class RegexContext(var lastGroup: Int = 0) { ... }

And so that our nested contexts know how many groups were there before them and report how many were added inside them, change the function pattern(...) :

 private fun pattern(block: RegexContext.() -> Unit): String { val innerContext = RegexContext(lastGroup) //   innerContext.block() lastGroup = innerContext.lastGroup //     return innerContext.regexParts.toString() }

Now nothing prevents us from correctly implementing the group :

 fun group(block: RegexContext.() -> Unit): Int { val result = ++lastGroup addPart("(${pattern(block)})") return result }

Named group case:

 fun group(name: String, block: RegexContext.() -> Unit): Int { val result = ++lastGroup addPart("(?<$name>${pattern(block)})") return result }

And matching groups, both indexed and named:

 fun matchGroup(index: Int) = addPart("\\$index") fun matchGroup(name: String) = addPart("\\k<$name>")

Something else?

Yes! We almost forgot the important construct of regular expressions - alternatives. For literals, alternatives are implemented trivially:

 fun anyOf(vararg terms: String) = addPart(terms.joinToString("|", "(?:", ")") { Regex.escape(it) }) //   terms    ,   , //     Regex.escape(...)

No more difficult implementation for nested expressions:

  fun anyOf(vararg blocks: RegexContext.() -> Unit) = addPart(blocks.joinToString("|", "(?:", ")") { pattern(it) })

The same for character sets:

 fun anyOf(vararg characters: Char) = addPart(characters.joinToString("", "[", "]").replace("\\", "\\\\").replace("^", "\\^")) fun anyOf(vararg ranges: CharRange) = addPart(ranges.joinToString("", "[", "]") { "${it.first}-${it.last}" })

But wait, what if we want to use different things as alternatives in the same anyOf(...) - for example, both a string and a block with a code for a nested subexpression? A slight disappointment awaits us here: in Kotlin, there are no union types (union types), and write the argument type String | RegexContext.() -> Unit | Char String | RegexContext.() -> Unit | Char String | RegexContext.() -> Unit | Char we can not. I was only able to get around this with frightening-looking crutches, which still do not make DSL better, so I decided to leave everything as it was written above - in the end, both String and Char can be written in nested subexpressions using the appropriate overload anyOf {...} .

Scary crutches

Use anyOf(vararg parts: Any) , where Any is the type to which any object belongs. Check which type is inside, respectively, and throw a IllegalArgumentException to a careless user who has passed a bad argument, and he will be very pleased.
Hardcore In Kotlin, a class can override the invoke operator () , and then objects of this class can be used as functions: myObject(arg) , and if the operator has several overloads, then the object will behave like several function overloads. Then you can try currying the anyOf(...) function, but since it has an arbitrary number of arguments, we don’t know when they will end — hence, each partial application should override the result of the previous one and then apply itself, as if its argument is the last .

If this is done neatly, it will even work, but we unexpectedly come to a head on the unpleasant moment in the Kotlin grammar: you cannot use consecutive calls with curly braces in the invoke operator's call chain.
```
 object anyOf { operator fun invoke(s: String) = anyOf //      operator fun invoke(r: RegexContext.() -> Unit) = anyOf } anyOf("a")("b")("c") //   anyOf("123") { anyChar() } { digit() } //    ! anyOf("123")({ anyChar() })({ digit() }) //   ((anyOf("123")) { anyChar() }) { digit() } //   
```
Well, and we need it so?

In addition, it would be nice to reuse regular expressions, both built by our DSL, and come to us from somewhere else. This is easy to do, the main thing is not to forget about the numbering of the groups. From the regex, you can pull out the number of its groups: Pattern.compile(pattern).matcher("").groupCount() , and all that remains is to implement the corresponding RegexContext function:

 fun include(regex: Regex) { val pattern = regex.pattern addPart(pattern) lastGroup += Pattern.compile(pattern).matcher("").groupCount() }

And on this, perhaps, the mandatory features end.

Conclusion

Thank you for reading to the end! What have we got? It is a viable DSL for regexes that can be used:

 fun RegexContext.byte() = anyOf({ literally("25"); anyOf('0'..'5') }, { literally("2"); anyOf('0'..'4'); digit() }, { anyOf("0", "1", ""); digit(); optional { digit() } }) val r = regex { 3 times { byte(); literally(".") } byte() }

(Question: what is this regex for? Really, is it simple?)

More advantages:

It is difficult to break the regex: you don't even have to write brackets with your hands, if the code is compiled and the groups are correct, then the regex is valid.
It turns out to visually form regex dynamically: it makes living code with any valid constructions like conditions, loops, and calls to third-party functions.
If you use indexed groups, the index is dynamically assigned to the group, and even changing a large regex written in DSL will not break group indexes.
Extensibility and reusability: in your code, you can write any extension function like byte() above and use it as an integral part of rehexes - russianLetter() , ipAddress() , time() ...

What did not work:

AnyOf anyOf(...) looks anyOf(...) , it was not possible to achieve better.
The recording density is much inferior to the traditional form, the rehex into a half-screen length turns into a block into a half-screen height. But, probably, readable.

Sources, tests, and dependencies ready to be added to a project are in the repository on Github .

What do you think about domain-specific regular expression languages? Have you ever used it? And other DSL?

I will be glad to discuss all that sounded.

Good luck!

Source: https://habr.com/ru/post/312776/

All Articles