Hello!
This article is about the implementation of one specific DSL ( domain specific language ) for regular expressions using Kotlin, but it may well give a general idea of how to write your DSL on Kotlin and what it usually will do "under the hood "any other DSL using the same language features.
Many have already used Kotlin, or at least tried to do it, and the rest could well have heard that Kotlin has to write elegant DSL, which has some brilliant examples - Anko and kotlinx.html .
Of course, for regular expressions, this has already been done (and yet: in Java , in Scala , in C # - there are many implementations, it seems, this is a common entertainment). But if you want to practice or try Kotlin's DSL-oriented language capabilities, then welcome to Cat.
In the worst case, probably so.
Most Java DSLs suggest using call chains for their constructs, as in this example Java Regex DSL:
Regex regex = RegexBuilder.create() .group("#timestamp") .number("#hour").constant(":") .number("#min").constant(":") .number("#secs").constant(":") .number("#ms") .end() .build();
We could use this approach, but it has a number of disadvantages, among which we can immediately note two:
inconvenient implementation of nested constructs ( group
and end
above), due to which you will have to fight with the formatter, and there is no banal verification of the opening and closing elements, you can write an extra .end()
;
With these shortcomings in Kotlin you can cope if you implement DSL in the style of Type-Safe Groovy-Style Builder (when explaining technical details, this article will largely repeat the documentation page by reference). Then the code on it will look like this Anko example:
verticalLayout { val name = editText() button("Say Hello") { onClick { toast("Hello, ${name.text}!") } } }
Or to this example kotlinx.html:
html { body { div { a("http://kotlinlang.org") { target = ATarget.blank +"Main site" } } } }
Looking ahead, I’ll say that the resulting language will look something like this:
val r = regex { group("prefix") { literally("+") oneOrMore { digit(); letter() } } 3 times { digit() } literally(";") matchGroup("prefix") }
class RegexContext { } fun regex(block: RegexContext.() -> Unit): Regex { throw NotImplementedError() }
What is written here? The regex
function returns the constructed Regex
and takes a single argument — another function of the type RegexContext.() -> Unit
. If you are already familiar with Kotlin, feel free to skip a couple of paragraphs explaining what it is.
The types of functions in Kotlin are written something like this: (Int, String) -> Boolean
is a predicate of two arguments - or so: SomeType.(Int) -> Unit
is a function that returns a Unit (analog of the void-function), and except the Int
argument also accepts a receiver of type SomeType
.
Functions that take a receiver help us a lot with building DSL due to the fact that you can pass a lambda expression as an argument of this type, and it will have an implicit this
that is of the same type as the receiver. A simple example is the library function with
:
fun <T, R> with(t: T, block: T.() -> R): R = t.block() // block receiver // , with(ArrayList<Int>()) { for(i in 1..10) { add(i) } // add receiver println(this) // this -- ArrayList<Int> }
Great, now we can call regex { ... }
and, inside the curly braces, work with some RegexContext
instance as if this
. It remains just a little - to implement the members RegexContext
. :)
Let's create a regular expression in parts - each statement of our DSL will simply append to the unfinished expression the next part. These parts will be stored by RegexContext
.
class RegexContext { internal val regexParts = mutableListOf<String>() // private fun addPart(part: String) { // regexParts.append(part) } }
Accordingly, the regex {...}
function will now look like this:
fun regex(block: RegexContext.() -> Unit): Regex { val context = RegexContext() context.block() // block, - context val pattern = context.regexParts.toString() return Regex(pattern) // - Regex }
Next, we implement the RegexContext
functions that add different parts to the regular expression.
The following functions, unless explicitly stated otherwise, are also located in the body of the class.
Right?
fun anyChar(s: String) = addPart(".")
This call simply adds a point to the expression, which denotes a subexpression corresponding to any single character.
Similarly, we implement the functions digit()
, letter()
, alphaNumeric()
, whitespace()
, wordBoundary()
, wordCharacter()
and even startOfString()
and endOfString()
- they all look about the same.
fun digit() = addPart("\\d") fun letter() = addPart("[[:alpha:]]") fun alphaNumeric() = addPart("[A-Za-z0-9]") fun whitespace() = addPart("\\s") fun wordBoundary() = addPart("\\b") fun wordCharacter() = addPart("\\w") fun startOfString() = addPart("^") fun endOfString() = addPart("$")
But to add an arbitrary string to a regular expression, you have to first convert it so that the characters present in the string are not interpreted as service characters. The easiest way to do this is with the Regex.escape (...) function:
fun literally(s: String) = addPart(Regex.escape(s))
For example, literally(".:[test]:.")
will add a part of \Q.:[test]:.\E
to the expression.
What about quantifiers? Obvious observation: the quantifier is hung on a subexpression, which in itself is also a valid regex. Let's add some nesting!
We want the nested block of code in curly braces to specify a subexpression of a quantifier, like this:
val r = regex { oneOrMore { optional { anyChar() } literally("-") } literally(";") }
We will do this with the help of the RegexContext
functions, which behave almost the same as regex {...}
, but use the constructed subexpression themselves. We first add auxiliary functions:
private fun addWithModifier(s: String, modifier: String) { addPart("(?:$s)$modifier") // non-capturing group } private fun pattern(block: RegexContext.() -> Unit): String { // -- val innerContext = RegexContext() innerContext.block() // block RegexContext return innerContext.regexParts.toString() // }
And then we use them to implement our "quantifiers":
fun optional(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "?") fun oneOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "+")
fun oneOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "+") fun oneOrMore(s: String) = oneOrMore { literally(s) } fun optional(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "?") fun optional(s: String) = optional { literally(s) } fun zeroOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "*") fun zeroOrMore(s: String) = zeroOrMore { literally(s) }
Even in regexes, it is possible to set the number of expected entries precisely or using a range. We want this too, right? It is also a good reason to use infix functions - functions of two arguments, one of which is receiver. Calls to such functions will be as follows:
val r = regex { 3 times { anyChar() } 2 timesOrMore { whitespace() } 3..5 times { literally("x") } // 3..5 -- IntRange }
And the functions themselves are declared as:
infix fun Int.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{$this}") infix fun IntRange.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{${first},${last}}")
infix fun Int.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{$this}") infix fun Int.times(s: String) = this times { literally(s) } infix fun IntRange.times(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{${first},${last}}") infix fun IntRange.times(s: String) = this times { literally(s) } infix fun Int.timesOrMore(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{$this,}") infix fun Int.timesOrMore(s: String) = this timesOrMore { literally(s) } infix fun Int.timesOrLess(block: RegexContext.() -> Unit) = addWithModifier(pattern(block), "{0,$this}") infix fun Int.timesOrLess(s: String) = this timesOrLess { literally(s) }
The tool for working with rehexes cannot be called such if it does not support groups, so let's support them, for example, in this form:
val r = regex { anyChar() val separator = group { literally("+"); digit() } // anyChar() matchGroup(separator) // anyChar() }
However, the groups introduce new complexity into the regex structure: they are numbered "through and through" from left to right, ignoring the nesting of subexpressions. This means that the calls of group {...}
cannot be considered independent of each other, and even more: all of our nested subexpressions are now also linked to each other.
To support the numbering of groups, slightly change the RegexContext
: now it will remember how many groups it already has:
class RegexContext(var lastGroup: Int = 0) { ... }
And so that our nested contexts know how many groups were there before them and report how many were added inside them, change the function pattern(...)
:
private fun pattern(block: RegexContext.() -> Unit): String { val innerContext = RegexContext(lastGroup) // innerContext.block() lastGroup = innerContext.lastGroup // return innerContext.regexParts.toString() }
Now nothing prevents us from correctly implementing the group
:
fun group(block: RegexContext.() -> Unit): Int { val result = ++lastGroup addPart("(${pattern(block)})") return result }
Named group case:
fun group(name: String, block: RegexContext.() -> Unit): Int { val result = ++lastGroup addPart("(?<$name>${pattern(block)})") return result }
And matching groups, both indexed and named:
fun matchGroup(index: Int) = addPart("\\$index") fun matchGroup(name: String) = addPart("\\k<$name>")
Yes! We almost forgot the important construct of regular expressions - alternatives. For literals, alternatives are implemented trivially:
fun anyOf(vararg terms: String) = addPart(terms.joinToString("|", "(?:", ")") { Regex.escape(it) }) // terms , , // Regex.escape(...)
No more difficult implementation for nested expressions:
fun anyOf(vararg blocks: RegexContext.() -> Unit) = addPart(blocks.joinToString("|", "(?:", ")") { pattern(it) })
fun anyOf(vararg characters: Char) = addPart(characters.joinToString("", "[", "]").replace("\\", "\\\\").replace("^", "\\^")) fun anyOf(vararg ranges: CharRange) = addPart(ranges.joinToString("", "[", "]") { "${it.first}-${it.last}" })
But wait, what if we want to use different things as alternatives in the same anyOf(...)
- for example, both a string and a block with a code for a nested subexpression? A slight disappointment awaits us here: in Kotlin, there are no union types (union types), and write the argument type String | RegexContext.() -> Unit | Char
String | RegexContext.() -> Unit | Char
String | RegexContext.() -> Unit | Char
we can not. I was only able to get around this with frightening-looking crutches, which still do not make DSL better, so I decided to leave everything as it was written above - in the end, both String
and Char
can be written in nested subexpressions using the appropriate overload anyOf {...}
.
Use anyOf(vararg parts: Any)
, where Any
is the type to which any object belongs. Check which type is inside, respectively, and throw a IllegalArgumentException
to a careless user who has passed a bad argument, and he will be very pleased.
Hardcore In Kotlin, a class can override the invoke operator () , and then objects of this class can be used as functions: myObject(arg)
, and if the operator has several overloads, then the object will behave like several function overloads. Then you can try currying the anyOf(...)
function, but since it has an arbitrary number of arguments, we don’t know when they will end — hence, each partial application should override the result of the previous one and then apply itself, as if its argument is the last .
If this is done neatly, it will even work, but we unexpectedly come to a head on the unpleasant moment in the Kotlin grammar: you cannot use consecutive calls with curly braces in the invoke
operator's call chain.
object anyOf { operator fun invoke(s: String) = anyOf // operator fun invoke(r: RegexContext.() -> Unit) = anyOf } anyOf("a")("b")("c") // anyOf("123") { anyChar() } { digit() } // ! anyOf("123")({ anyChar() })({ digit() }) // ((anyOf("123")) { anyChar() }) { digit() } //
Well, and we need it so?
In addition, it would be nice to reuse regular expressions, both built by our DSL, and come to us from somewhere else. This is easy to do, the main thing is not to forget about the numbering of the groups. From the regex, you can pull out the number of its groups: Pattern.compile(pattern).matcher("").groupCount()
, and all that remains is to implement the corresponding RegexContext
function:
fun include(regex: Regex) { val pattern = regex.pattern addPart(pattern) lastGroup += Pattern.compile(pattern).matcher("").groupCount() }
And on this, perhaps, the mandatory features end.
Thank you for reading to the end! What have we got? It is a viable DSL for regexes that can be used:
fun RegexContext.byte() = anyOf({ literally("25"); anyOf('0'..'5') }, { literally("2"); anyOf('0'..'4'); digit() }, { anyOf("0", "1", ""); digit(); optional { digit() } }) val r = regex { 3 times { byte(); literally(".") } byte() }
(Question: what is this regex for? Really, is it simple?)
More advantages:
byte()
above and use it as an integral part of rehexes - russianLetter()
, ipAddress()
, time()
...What did not work:
anyOf(...)
looks anyOf(...)
, it was not possible to achieve better.Sources, tests, and dependencies ready to be added to a project are in the repository on Github .
What do you think about domain-specific regular expression languages? Have you ever used it? And other DSL?
I will be glad to discuss all that sounded.
Good luck!
Source: https://habr.com/ru/post/312776/
All Articles