📜 ⬆️ ⬇️

XPath: speeding up NodeList iteration

When trying to process a not-so-small regular XML file (actually only about a thousand records), I found that iterating over a NodeList along with extracting using XPath starts to slow down significantly (taking about 2 minutes on my file), and the brakes increase with processing each next node (node). This issue also rises

blog.astradele.com/2006/02/24/slow-xpath-evaluation-for-large-xml-documents-in-java-15

jbwhammie.blogspot.com/2011/02/make-java-xpath-work-on-large-files.html
')
As I understand it, this is a certain feature of the DOM & XPath implementation in the JDK, which is that if we receive an XPath-request from a Node from XML, then this Node remains referenced through parent to the parent XML Document. And the subsequent attempts of XPath requests to this Node lead to the fact that not only the Node itself is processed, but also the entire XML Document. Therefore, it is logical to attempt to null parent from Node, for example, by removing this Node from its parent Node. This is the way suggested by the links above and it works. Its only drawback is that it alters the XML Document itself, making it no longer suitable for further data extraction. I found that there is another way of not changing the XML Document itself — cloning a Node, because, according to the documentation, with clone parent = null .

Actually, the described problem and approaches to the solution are illustrated by the following code.

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import java.io.ByteArrayInputStream;

public class SlowXPath {
public static void main( String [] args) throws Exception {
final DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();

final DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

String xmlString = prepareXml(5000);

// System.out.println(xmlString);
final Document xmlDoc = documentBuilder.parse( new ByteArrayInputStream(xmlString.getBytes()));

final XPathFactory xPathFactory = XPathFactory.newInstance();

final XPathExpression nodeXPath = xPathFactory.newXPath().compile( "//node" );
final XPathExpression iXPath = xPathFactory.newXPath().compile( "./i/text()" );

final NodeList nodeList = (NodeList) nodeXPath.evaluate(xmlDoc, XPathConstants.NODESET);

System. out .println( "Nodes number=" + nodeList.getLength());

timeIt( "Simple iterate" , new Runnable() {
@Override
public void run() {
int sum = 0;

for ( int i = 0; i < nodeList.getLength(); i++) {
final Node node = nodeList.item(i);
try {
final String iStr = ( String ) iXPath.evaluate(node, XPathConstants.STRING);
sum += Integer.parseInt(iStr.trim());
} catch (XPathExpressionException e) {
e.printStackTrace();
}
}

System. out .println( "Sum=" + sum);
}
});
timeIt( "Iterate with cloning" , new Runnable() {
@Override
public void run() {
int sum = 0;

for ( int i = 0; i < nodeList.getLength(); i++) {
final Node node = nodeList.item(i).cloneNode( true ); // <-- Note cloning here
try {
final String iStr = ( String ) iXPath.evaluate(node, XPathConstants.STRING);
sum += Integer.parseInt(iStr.trim());
} catch (XPathExpressionException e) {
e.printStackTrace();
}
}

System. out .println( "Sum=" + sum);
}
});
timeIt( "Iterate with detaching node" , new Runnable() {
@Override
public void run() {
int sum = 0;

for ( int i = 0; i < nodeList.getLength(); i++) {
final Node node = nodeList.item(i);

node.getParentNode().removeChild(node); // <-- Note detaching node

try {
final String iStr = ( String ) iXPath.evaluate(node, XPathConstants.STRING);
sum += Integer.parseInt(iStr.trim());
} catch (XPathExpressionException e) {
e.printStackTrace();
}
}

System. out .println( "Sum=" + sum);
}
});
}

private static String prepareXml( int n) {
StringBuilder sb = new StringBuilder ();

sb.append( "<root>" );

for ( int i = 0; i < n; i++) {
sb.append( "<node><i>\n" )
.append(i)
.append( "</i></node>\n" );
}

sb.append( "</root>" );

return sb.toString();
}

private static void timeIt( String name, Runnable runnable) {
long t0 = System.currentTimeMillis();
runnable.run();
long t1 = System.currentTimeMillis();

System. out .println(name + " executed " + ((t1 - t0) / 1000f) + "sec." );
}
}

* This source code was highlighted with Source Code Highlighter .


But the result of the work:

Nodes number=5000
Sum=12497500
Simple iterate executed 17.359sec.
Sum=12497500
Iterate with cloning executed 1.047sec.
Sum=12497500
Iterate with detaching node executed 1.031sec.

Source: https://habr.com/ru/post/128175/


All Articles