This topic talks about the possibility of using XPath to select objects from the database in cases when using SQL is undesirable.
Formulation of the problem
In many systems, for various purposes it is often required to select objects from the database as the values ​​of various fields. For example, select a supplier of goods from the list. This is part of the system user interface.
However, besides users, the system has programmers, tuning engineers, designers, etc., who also sometimes need to tell the system which objects to select. For example, it is possible that after selecting a supplier’s city, it is necessary to limit the list of suppliers. For this, the programmer usually writes the corresponding SQL, for example, “SELECT id, name FROM agents WHERE city =?”. That is, SQL is used to tell the system which objects to select from the database.
Most often, this choice is part of the system source code. Hidden in PHP or in Java code (or, better, in SQL bundles or in stored procedures) such code looks familiar. But sometimes the filter becomes not part of the source code, but part of the system setup that the system engineer or layout designer XSLT / HTML produces.
')
Example 1: Enterprise Management System. There is a list of objects "Employee" and a list of objects "Office". It is necessary to add an office selection field to the “Employee” object, while showing the list of offices in the same country where the employee is registered. The usual way is to write the corresponding SQL filter to the “Office employee” attribute properties of the “Employee” class. This SQL takes, for example, the input of an employee ID and calculates the corresponding list of offices.
Example 2: A content management system based on XSLT. An object is selected from the database (for example, an article), XML is built, after which the XSLT transform produces HTML and is given to the client. However, often, apart from the text of the article itself, additional elements should be attached to XML - for example, a list of other articles in the section for quick navigation. Or even a list of sections of the current site, if the design allows you to change it through the control system. To select the corresponding objects one could use the same SQL, taking as arguments the ID of the current object (article).
Using SQL in such examples has the following disadvantages:
1) A system engineer who is not familiar with Java / PHP / .NET should still know SQL
2) The engineer must know the internal structure of the system tables.
3) Using SQL for such settings of the application "freezes" the structure of the tables, making it part of the public API
4) This SQL has the right to read at least all the data from the relevant tables. Restricting rights to an application without using row-level security is unrealistic.
Therefore, in Arp.Site, another method of indication is used to select objects from template texts. Since XSLT is mainly used for templates, it was most logical to use the technology that is closest to XML. Namely XPath.
XPath allows you to select objects from the
tree (more precisely, from an acyclic oriented graph), taking into account
predicate constraints. Writing expressions to XPath is more compact and more comprehensible than in SQL.
Example 1:
//[@=$/@]
Example 2:
//folder[@active='true']/article
- to select articles from the same section.
//site[@active='true']/folder
- to select sections of the current site.
Technical implementation
For implementation it is possible to use two approaches: interpretation and compilation.
Interpretation is the construction of a real (for very small sites) or a virtual (for large systems) DOM tree of objects and “feeding” to an XPath engine of the XPath expression itself and the root (or current object) of the DOM tree. The result of the execution is a pointer to another object (or NodeList) of the DOM tree, which is converted into a set of system IDs.
Benefits:
- the fastest way to implement "from scratch"
- the most understandable, if there are no optimizations aimed at accelerating
disadvantages
- the most brake way, if each operation of transition on a tree will demand the appeal to a database
- It is very difficult to do execution optimization.
What is optimization and why are they needed? Example:
- part of the XPath expression is, for example,
//article[@name='123']
. Any XPath processor will go through the entire tree of objects, and for each will compare its type (article), as well as the presence of the
name property and its equality to the value '123'. The presence of optimizations makes it possible to replace all such operations, with, for example, one operation of selecting all the articles whose names are equal to '123'.
- XPath expression
//*[@active='true']
should select the list of “current” objects - site, section, subsection, article (for building the navigation path). However, enumeration of all objects on the tree is undesirable, therefore these expressions should also be optimized.
Arp.Site uses the Apache Commons JXPath library for interpretation. This library accepts any object as an input (bin, DOM node, JDOM node, etc), or any arbitrary if you explain to the library how to get properties and child objects from this object. Also, this library allows you to replace some standard handlers with your own. For example, replace the Step handler for Axis = DESCENDANT_OR_SELF (XPath operation "//") with your optimized one.
The input of the library is the object associated with the information tree object. If necessary, information about its descendants is loaded from the database.
SQL Compilation
The second approach was the preliminary compilation of XPath to SQL, after which it was cached for the future and executed.
Benefits
- significant acceleration due to the uniqueness of the SQL database query
disadvantages
- it is much more difficult to write code, you need to keep track of all used table alias, return values, correctly handle various predicates.
- Work on building SQL comes down to working on strings
- a significant number of optimizations are required to simplify and speed up SQL
An example of optimization. Suppose we have XPATH
//site[@active='true']/folder
. Without SQL optimization, it could look like this:
SELECT f.id FROM objects f, objects s WHERE s.id=f.parent AND s.class='site' AND f.class='folder' AND s.id IN (1,4,73,423)
(where 1, 4,73,423 - a list of "current" objects).
However, the system can use the following hints:
- site in the object tree cannot be a descendant of site (i.e. it is only one in the list of current objects)
- site exactly is in the database
- its id is "1"
then SQL can obviously be rewritten as
SELECT f.id FROM objects f WHERE f.parent=1 f.class='folder'
There are a lot of similar optimizations, although most of them are system-specific.
Examples of compilations in SQL:
//site[@ active='true']/*[@active='true']/*[@state='published']
SELECT t7.id FROM struct_cells t7, registry_objects t8 WHERE t7.obj=t8.id AND t7.erased=0 AND t8.erased=0 AND t8.state=2 AND t7.state=3 AND t7.parent=360965
//*[@current='true']/*[@state='published'] | //*[@current='true']/*[@state='published']/file[@state='published' or @state='archived']
(SELECT t4.id FROM struct_cells t4, registry_objects t5 WHERE t4.obj=t5.id AND t4.erased=0 AND t5.erased=0 AND t5.state=2 AND t4.state=3 AND t4.parent=1) UNION (SELECT t13.id FROM struct_cells t13, registry_objects t14, struct_cells t10, registry_objects t11 WHERE t13.obj=t14.id AND t13.erased=0 AND t14.erased=0 AND t14.class=10 AND (t14.state=2 AND t13.state IN (3,4)) AND t10.obj=t11.id AND t10.erased=0 AND t11.erased=0 AND t11.state=2 AND t10.state=3 AND t10.parent=1 AND t13.parent=t10.id))
Compilation in JPAQL
The third approach, developed after the appearance of JPA 2.0, which included the CriteriaQuery API, is to build a query using the corresponding API without direct operations with query strings.
Benefits
- significant simplification of query building code
- getting rid of typo errors
- the code will work on any database supported by JPA
disadvantages
- a significant number of optimizations are required to simplify and speed up EJBQL
- slight deceleration due to possible non-optimized translator code EJBQL -> SQL
- a significant slowdown due to the need to emulate functions like UNION that are missing in EJBQL.
- the lack of support for some operations, for example, limit / row_number in CriteriaQuery, makes it impossible to support position-predicates, for example,
//site/folder[2]
.
Examples of compilations in JPAQL:
//site[@active='true']/*[@active='true']/*[@state='published']
SELECT t0.id
FROM ru.arptek.arpsite.content.Cell as t1, ru.arptek.arpsite.content.Cell as t0
INNER JOIN t0.webObject as t2
WHERE ( t1.parent.id=360930 ) and ( t1.erased=0 ) and ( t1.id in (360965, 360930, 417026, 124316, 63316, 1) ) and ( t1.id=t0.parent.id ) and ( t0.erased=0 ) and ( t2.erased=0 ) and ( ( t2.stateId=2 ) and ( t0.stateId=3 ) )
//*[@current='true']/*[@state='published'] | //*[@current='true']/*[@state='published']/file[@state='published' or @state='archived']
SELECT t0.id
FROM ru.arptek.arpsite.content.Cell as t0
WHERE ( exists (
SELECT 1
FROM ru.arptek.arpsite.content.Cell as t1
INNER JOIN t1.webObject as t2
WHERE ( t1.parent.id=1 ) and ( t1.erased=0 ) and ( t2.erased=0 ) and ( ( t2.stateId=2 ) and ( t1.stateId=3 ) ) and ( t0=t1 )) ) or ( exists (
SELECT 1
FROM ru.arptek.arpsite.content.Cell as t3, ru.arptek.arpsite.content.Cell as t4
INNER JOIN t3.webObject as t5
INNER JOIN t4.webObject as t6
WHERE ( t4.parent.id=1 ) and ( t4.erased=0 ) and ( t6.erased=0 ) and ( ( t6.stateId=2 ) and ( t4.stateId=3 ) ) and ( t4.id=t3.parent.id ) and ( t3.erased=0 ) and ( t5.erased=0 ) and ( t5.objectClassId=10 ) and ( ( t5.stateId=2 ) and ( t3.stateId in (3, 4) ) ) and ( t0=t3 )) )
Legal notice
This idea has been used in Arp.Site since 2002 and has only undergone technological changes (interpretation, compilation in SQL, compilation in JPAQL). I have no direct relation to the implementation of this idea from my other employers (although I helped with tips), so the content of this topic does not fall under the relevant NDA. Due to the presence of prior art, this technology cannot be protected by patents dated after 2002. The presence of earlier patents has not been investigated.
The author of the initial implementation is Timofey Sysoev, the introduction of compilation technology for SQL and for JPAQL is yours truly.