📜 ⬆️ ⬇️

Language in language or embed XPath in Scala

Scala is a great language. You can fall in love with him. The code may be concise, but understandable; flexible but strongly typed. Well-thought-out tools allow you not to struggle with the language, but to express your ideas on it.

But the same tools allow you to write extremely complex code.
Using intellectual equilibristism in the style of scalaz or calculations on a shapeless type system is a guarantee that your code will be understood by units.

In this article I will talk about what to do, most likely, not worth it.
I will tell you how to embed another language into scala.

And although we will build XPath with you, the described method is suitable for any language for which you can build a syntax tree.
')

The reasons


In scala, it is possible to use all the tools for working with xml, which is in Java (and there are not a few of them). But the code will also resemble good old Java code. Not too joyful prospect.

There is a custom xml embedded in the syntax of the language:

scala> <root> | <node attr="aaa"/> | <node attr="111"> text</node> | </root> res0: scala.xml.Elem = <root> <node attr="aaa"/> <node attr="111"> text</node> </root> scala> res0 \\ "node" res1: scala.xml.NodeSeq = NodeSeq(<node attr="aaa"/>, <node attr="111"> text</node>) scala> res1 \\ "@attr" res2: scala.xml.NodeSeq = NodeSeq(aaa, 111) 

It seems that here it is happiness, but no. This only vaguely resembles XPath. At least a little complex queries become cumbersome and unreadable.

But after some acquaintance with scala it becomes clear that the creators are not unfoundedly called scala extensible (scalable) language. And if something is missing, then this can be added.

The task I set myself the maximum proximity to XPath with easy integration into the language.

Result


All developments here: https://github.com/senia-psm/scala-xpath.git
How to look.
If you do not have git and sbt yet, then you will have to install them ( git , sbt ) and, if necessary, set up a proxy ( git , sbt - in Program Files (x86) \ SBT \ there is a special txt for such options).

Clone repository:
 git clone https://github.com/senia-psm/scala-xpath.git 

Go to the repository folder (scala-xpath) and open the REPL in the project:
 sbt console 

Also in many examples, it is assumed that the following imports are performed:
 import senia.scala_xpath.macros._, senia.scala_xpath.model._ 




What and how


The way to achieve the goal is uniquely determined by the goal itself.
Embed XPath in the form of DSL, obviously, will not work. Otherwise, it will not be quite XPath. An XPath scala expression can only be placed as a string.
And that means:

  1. Parser combinators . We have to parse the string for validation.
  2. String interpolation . To embed variables and functions in XPath.
  3. Macros . To check at compile time.


Prepare the object model.


We take the XPath 1.0 specification and rewrite it on scala.
Almost all logic is expressed through the type system and the scala inheritance mechanism. Exceptions - in a couple of places of restriction through require.
Here it is worth noting the keyword "sealed", prohibiting the inheritance of a class (or implement the interface) outside this file. In particular, when compared with the “sealed” pattern, the compiler can be checked that all possible options are taken into account.

XPath parsim


Introduction to Parsers
Parsers are functions that take a sequence of elements and return, if successful, the result of processing and the rest of the sequence.
Unsuccessful results come in two kinds of “failure” (Failure) and “error” (Error).
Figuratively speaking, the parser bites off part of the sequence from the beginning and converts the bitten bit into an object of a certain type.

The simplest parser is a parser that verifies that the first element in the sequence is equal to the previously specified one and returns the element as a successful result. As a remainder, there will be a sequence without this element.

To create such a parser from an element, the method accept is used. This method is defined as implicit, and if the compiler encounters an element where it expects to encounter a parser, it will add the application of this method to the element.
Suppose we parse a sequence of characters:
 def elementParser: Parser[Char] = 'c' //  def elementParser: Parser[Char] = accept('c') //   

Thus, if you see an element when combining parsers where a parser is to be, then you know what that elementary parser is meant for.

In general, this is the only parser that is defined explicitly.
All other parsers are obtained by combining others and transforming the results.

We combine parsers

Lie for good
In fact, there are no operators in scala, but if you know this, then most likely you don’t need to tell you about parsers.

Binary operator "~". Combines 2 parsers according to the “and” principle. It is successful only if the first parser is successful first, and then the second on the remainder, which the first one gave.
Figuratively speaking, first the first parser bites off what suits it, and then the second feasts on the leftovers.
A container containing the results of both parsers is returned as a result.
 parser1 ~ parser2 

This way you can combine any set of parsers.
This combinator has 2 related: "~>" and "<~". They work the same way, but return the result of only one of the combined parsers.

Binary operator "|". Combining on the principle of "or". Successful if at least one of the results is successful at the initial input. If the first parser returned “failure” (but not an error), then we try to feed the same input to the second one.

rep. Sequence. If you have a myParser parser, then a parser formed with the help of “rep (myParser)” will “bite off” with the help of myParser from the input to the first unsuccessful application. The results of all the "bites" are combined into a collection.
There are related transformations, for a non-empty collection of results (rep1) and for a sequence with delimiters (repsep)

Convert the result

If you want to transform over the result of parsing, then come to the aid of operators such as ^^^ and ^^
^^^ changes the result to the specified constant, and ^^ transforms the result using the specified function.


Combining parsers (and literacy w3c specifications) allows you to write a parser without thinking.
In fact, we rewrite the specification for the second time. The only significant difference is that I replaced the recursive definitions with “cyclic” (rep and repsep).

For example:

Specification:
  [15] PrimaryExpr :: = VariableReference	
                                       |  '(' Expr ')'	
                                       |  Literal	
                                       |  Number	
                                       |  Functioncall 

Parser :
  def primaryExpr: Parser[PrimaryExpr] = variableReference | `(` ~> expr <~ `)` ^^ { GroupedExpr } | functionCall | number | literal 

The only condition is that you need to make sure that the most "strict" parsers go in the union through "|" before the rest. In this example, literal will obviously succeed wherever functionCall succeeds simply because it successfully parses the name of the function, so if you put the literal earlier, then it just won't get to functionCall.
The entire set of parsers was packed in a hundred and fifty lines, which is significantly shorter than the definition of the object model.

Mixing Variables


To add variables to the expression, we will use the string interpolation mechanism, introduced in version 2.10.
The mechanism is quite simple: by encountering a line in front of which (without a space) there is a valid method name, the compiler produces a simple conversion:
 t"strinf $x interpolation ${ obj.name.toString } " StringContext("strinf ", " interpolation ", " ").t(x, { obj.name.toString }) 

The string is broken into pieces by the occurrences of variables and expressions and passed to the factory StringContext method. The name preceding the string is used as the name of the method, and all variables and expressions are passed to this method as parameters.
If this ends with methods like “s” and “f”, then for methods that are not in StringContext, the compiler looks for an implicit class wrapper over StringContext containing the desired method. Such a search is a general mechanism for scala and is not directly related to string interpolation.
Summary Code:
  new MyStringContextHelper(StringContext("strinf ", " interpolation ", " ")).t(x, { obj.name.toString }) 


But what about our parser? We no longer have a continuous sequence of characters. And there is a sequence of characters and something else.
Is all the work a cat under the tail?
This is where the usefulness of the possibility of parsing is not only a sequence of characters.
We have a sequence of characters and something else (more on that later). This is described by the Either concept. On Habré a couple of articles about Either translated Sigrlami .
To regain all the power of parsers, you just need to write a couple of auxiliary tools. In particular, the conversion from Char, String and Regex to the corresponding parsers.
Here are all the necessary tools: EitherParsers . It is worth paying attention to the abstract type R. No assumptions have been made about it, so the toolkit is suitable for a previously unknown method of representing variables.

We interfere in compilation


Documentation and reasonable examples of macros in my opinion a little. But this does not mean that I am going to write a comprehensive explanation of what macros are and what they eat with.
First of all, you should know that the macro is called when the compiler encounters a method implemented using the macro keyword and the macro implementation must output the newly created syntax tree.
Let's see what kind of tree we have to give on the simplest example:

 scala> import scala.reflect.runtime.universe._ import scala.reflect.runtime.universe._ scala> showRaw(reify( "str" -> 'symb )) res0: String = Expr(Apply(Select(Apply(Select(Ident(scala.Predef), newTermName("any2ArrowAssoc")), List(Literal(Constant("str")))), newTermName("$minus$greater")), List(Apply(Select(Ident(scala.Symbol), newTermName("apply")), List(Literal(Constant("symb"))))))) 


There is no desire to build this on your own.
Let's see what scala offers us with the preservation of typing and without manual work.
On the one hand, there are not many: the literal method, which allows you to convert a certain limited set of “basic types” to syntactic trees, and reify, which does all the manual work for you, but only if you introduce any variables to the outside as the same tree. , then use the splice method of this tree, designed specifically to inform reify about your desire to embed Expt [T] type expressions as part of a new tree with the resulting type T.
On the other hand, these methods are quite enough. Additional can be written based on available.

The addition of interpolation processed by a macro itself is extremely concise:
  implicit class XPathContext(sc: StringContext) { def xp(as: Any*): LocationPath = macro xpathImpl } 


The macro processing function is declared as follows:
 def xpathImpl(c: Context)(as: c.Expr[Any]*): c.Expr[LocationPath] 

It is clear where to get the variables, but how to get the string?
To do this, you can use the context to “look out” of the function. So say look around.
Or rather look at the expression in which the target method xp is called.
This can be done using c.prefix.
But what will we find there? Earlier it was mentioned that there should be a StringContext expression ("strinf", "interpolation", "").
Let's look at the corresponding tree:
 scala> import scala.reflect.runtime.universe._ import scala.reflect.runtime.universe._ scala> showRaw(reify(StringContext("strinf ", " interpolation ", " "))) res0: String = Expr(Apply(Select(Ident(scala.StringContext), newTermName("apply")), List(Literal(Constant("strinf ")), Literal(Constant(" interpolation ")), Literal(Constant(" "))))) 

As we can see from here you can get all the lines in an explicit form, which we will do:
  val strings = c.prefix.tree match { case Apply(_, List(Apply(_, ss))) => ss case _ => c.abort(c.enclosingPosition, "not a interpolation of XPath. Cannot extract parts.") } val chars = strings.map{ case c.universe.Literal(Constant(source: String)) => source.map{ Left(_) } case _ => c.abort(c.enclosingPosition, "not a interpolation of XPath. Cannot extract string.") } 


But not only the entrance has changed. The result of the parser can no longer be an object from our object model - it is simply not to build it based not on a value, but on a parameter of the type c.Expr [Any].

Change our parser accordingly. If an external variable can somehow appear as a result, the parser can no longer return T, but should return c.Expr [T]. For conversions of non-elementary types to the corresponding Expr, we write literal helper methods based on the available ones, for example:
  def literal(name: QName): lc.Expr[QName] = reify{ QName(literal(name.prefix).splice, literal(name.localPart).splice) } 

The principle of all such functions is very simple: we parse the argument into quite elementary parts and assemble it again within reify.

This will require some mechanical work, but our parser will not change much.

The final step is the implementation of several parsers that can parse a variable at the input.
Here is the parser for embedding the variable:
  accept("xc.Expr[Any]", { case Right(e) => e } ) ^? ({ case e: xc.Expr[BigInt] if confirmType(e, tagOfBigInt.tpe) => reify{ CustomIntVariableExpr(VariableReference(QName(None, NCName(xc.literal(nextVarName).splice))), e.splice) } case e: xc.Expr[Double] if confirmType(e, xc.universe.definitions.DoubleClass.toType) => reify{ CustomDoubleVariableExpr(VariableReference(QName(None, NCName(xc.literal(nextVarName).splice))), e.splice) } case e: xc.Expr[String] if confirmType(e, xc.universe.definitions.StringClass.toType) => reify{ CustomStringVariableExpr(VariableReference(QName(None, NCName(xc.literal(nextVarName).splice))), e.splice) } }, e => s"Int, Long, BigInt, Double or String expression expected, $e found." ) 

The initial accept ("xc.Expr [Any]", {case Right (e) => e}) parser is very simple - it accepts any Right container with a tree and returns this tree.
Further conversion determines whether this variable can be used as one of the three desired types and then converted to that use.

As a result, we get the following behavior:
 scala> val xml = <book attr="111"/> xml: scala.xml.Elem = <book attr="111"/> scala> val os = Option("111") os: Option[String] = Some(111) scala> xml \\ xp"*[@attr = $os]" // Option[String]    <console>:16: error: Int, Long, BigInt, Double or String expression expected, Expr[Nothing](os) found. xml \\ xp"*[@attr = $os]" ^ scala> xml \\ xp"*[@attr = ${ os.getOrElse("") } ]" //   String   res1: scala.xml.NodeSeq = NodeSeq(<book attr="111"/>) 


And if the error messages still need some work, the variables are already built in quite conveniently.

Embedding functions required quite a lot of code (23 options, one for options from 0 to 22 parameters) and does not work very conveniently, since you only need to take Any, but comes mainly with a NodeList (but the string can come or Double):

 scala> import org.w3c.dom.NodeList import org.w3c.dom.NodeList scala> val isAllowedAttributeOrText = (_: Any, _: Any) match { // - ,       | case (a: NodeList, t: NodeList) if a.getLength == 1 && t.getLength == 1 => | a.head.getTextContent == "aaa" || | t.head.getTextContent.length > 4 | case _ => false | } isAllowedAttributeOrText: (Any, Any) => Boolean = <function2> scala> val xml = <root attr="11111" ><inner attr="111" /><inner attr="aaa" >inner text</inner> text </root> xml: scala.xml.Elem = <root attr="11111"><inner attr="111"/><inner attr="aaa">inner text</inner> text </root> scala> xml \\ xp"*[$isAllowedAttributeOrText(@attr, text())]" res0: scala.xml.NodeSeq = NodeSeq(<root attr="11111"><inner attr="111"/><inner attr="aaa">inner text</inner> text </root>, <inner attr="aaa">inner text</inner>) 


Here we received the first copying of the XPath syntax (except for the possibility of writing expressions like $ {arbitrary code} instead of variables) - the function being injected must be preceded by a dollar.

Method implementation


Naturally, the methods "\" and "\\" of scala.xml.NodeSeq did not appear by magic, they are added using the implicit class in the package object of the model.

Similar methods are built into org.w3c.dom.Node and NodeList .

And with the use of XPath, certain problems arise.

Unresolved problems


Get rid of java.lang.System.setSecurityManager (null). Judging by the implementation of com.sun.org.apache.xpath.internal.jaxp.XPathFactoryImpl, one should not add one’s own function handler.

Errors at compile time need some work.
If, if the function is not correct, the error message is ideal (a separate compliment in the direction of compiler prudence):
 scala> xml \\ xp"*[$isAllowedAttributeOrText(@attr)]" <console>:1: error: type mismatch; found : (Any, Any) => Boolean required: Any => Any xml \\ xp"*[$isAllowedAttributeOrText(@attr)]" ^ 

then for all other errors the standard message format is not observed and the position indicates the beginning of the line.
Unlike the previous one, this problem can be solved.

Performance when working with scala.xml leaves much to be desired. In fact, the first conversion from scala.xml to w3c.dom takes place through a string, and then the opposite.
The only possible solution is to handle XPath yourself.
At the same time it will allow to get rid of not too convenient typing of functions.

Performance when working with w3c.dom can be slightly improved. XPath is currently compiled from a string, although there is a ready-made object model. Conversion between object models can speed up XPath creation somewhat.

Conclusion


We built XPath into scala without serious problems and limitations.
Variables and functions from the current scope are valid wherever the specification allows them.
When used with w3c.dom and with some modifications, even a minor acceleration is possible due to parsing the expression during compilation.

Everything is much simpler. what it seems at first glance.
At the beginning, the very idea of ​​embedding in a compilation is shocking. The result is achieved with minimal effort.
Yes, the compiler API is documented much worse than the main library, but it is logical and understandable.
Yes, IDEA poorly understands path-dependent types, but it provides very convenient navigation, including the compiler API and takes into account implicit conversions.

Source: https://habr.com/ru/post/176285/


All Articles