📜 ⬆️ ⬇️

JAXB and XSLT using StAX

In one of the projects, it was necessary to process large XML files, from hundreds of megabytes to tens of gigabytes.
And it was necessary to pull out only some tags located at different "depths". XSLT "in the forehead" broke from lack of memory. I had to think and think about the stream parser.

There are several XML processing models. The most famous are DOM and SAX.
The DOM loads the entire XML document, builds its internal representation, and provides the ability to navigate the entire document. SAX, on the other hand, reads the input document and, when recognizing an element, calls handlers for processing.

In my case, the DOM was dropped due to memory consumption. The SAX API is built on handlers, which results in less readable code. StAX is a stream parser (like SAX), but the API is built on the principle of pull. That is, the recognized elements are "removed" from the stream on demand.

Since the data structures covered by the processing were very complex and diverse, and the processing is rather nontrivial, it was decided to use JAXB to translate into an internal representation.
')
The project data is closed by the NDA, so the article is not used.

And so, there is the following
XML document
<data> <dtype_one> <p1>p1_data_1</p1> <p2>p1_data_1</p2> <p3>p1_data_1</p3> <p4>p1_data_1</p4> <p5>p1_data_1</p5> </dtype_one> <dtype_two> <p1>p1_data_2</p1> <p2>p1_data_2</p2> <p3>p1_data_2</p3> <p4>p1_data_2</p4> <p5>p1_data_2</p5> </dtype_two> <WS> <dtype_three> <p1>p1_data_3</p1> <p2>p1_data_3</p2> <p3>p1_data_3</p3> <p4>p1_data_3</p4> <p5>p1_data_3</p5> </dtype_three> </WS> </data> 

From it you need to select and process dtype_one, dtype_two and dtype_three tags. Tags are repeated in the document. Take
document schema
 <?xml version="1.0" encoding="UTF-8"?> <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="data" type="dataType"/> <xs:element name="dtype_one" type="dtype_oneType"/> <xs:element name="dtype_two" type="dtype_twoType"/> <xs:element name="dtype_three" type="dtype_threeType"/> <xs:complexType name="dtype_oneType"> <xs:sequence> <xs:element type="xs:string" name="p1"/> <xs:element type="xs:string" name="p2"/> <xs:element type="xs:string" name="p3"/> <xs:element type="xs:string" name="p4"/> <xs:element type="xs:string" name="p5"/> </xs:sequence> </xs:complexType> <xs:complexType name="dataType"> <xs:sequence> <xs:element type="dtype_oneType" name="dtype_one"/> <xs:element type="dtype_twoType" name="dtype_two"/> <xs:element type="WSType" name="WS"/> </xs:sequence> </xs:complexType> <xs:complexType name="WSType"> <xs:sequence> <xs:element type="dtype_threeType" name="dtype_three"/> </xs:sequence> </xs:complexType> <xs:complexType name="dtype_twoType"> <xs:sequence> <xs:element type="xs:string" name="p1"/> <xs:element type="xs:string" name="p2"/> <xs:element type="xs:string" name="p3"/> <xs:element type="xs:string" name="p4"/> <xs:element type="xs:string" name="p5"/> </xs:sequence> </xs:complexType> <xs:complexType name="dtype_threeType"> <xs:sequence> <xs:element type="xs:string" name="p1"/> <xs:element type="xs:string" name="p2"/> <xs:element type="xs:string" name="p3"/> <xs:element type="xs:string" name="p4"/> <xs:element type="xs:string" name="p5"/> </xs:sequence> </xs:complexType> </xs:schema> 


and make sure that it contains the “element” elements of the tags we need:
  <xs:element name="dtype_one" type="dtype_oneType"/> <xs:element name="dtype_two" type="dtype_twoType"/> <xs:element name="dtype_three" type="dtype_threeType"/> 

if there is no schema, IDEA can generate it from an xml file.

This is necessary in order for XJC to generate the @XmlRootElement annotation. The project is going to maven, to call XJC maven-jaxb2-plugin is used . To generate @XmlRootElement for all "element" in the schema file, you need to add the following lines to the bindings.xjb file:
 <jaxb:bindings> <jaxb:globalBindings > <xjc:simple/> </jaxb:globalBindings> </jaxb:bindings> 

and connect it in the maven-jaxb2-plugin plugin configuration, in pom.xml
 <bindingDirectory>${project.basedir}/xjb</bindingDirectory> 

Now to the code itself, the TagEngine class stores the list of tag handlers and parses :
  public void process(InputStream inputStream) throws FileNotFoundException, XMLStreamException, TransformerException { //  XMLStreamReader,   XMLInputFactory factory = XMLInputFactory.newFactory(); XMLStreamReader streamReader = factory.createXMLStreamReader(inputStream); //   Stack<String> tagStack = new Stack<String>(); //    while (streamReader.hasNext()) { //   int eventType = streamReader.next(); //    if(eventType == XMLStreamConstants.START_ELEMENT) { //    tagStack.push(streamReader.getName().toString()); //     TagProcessor t = processorMap.get(tagStack); if(t != null) { // ,  t.process(streamReader); tagStack.pop(); } } else if(eventType == XMLStreamConstants.END_ELEMENT) { tagStack.pop(); } } } 

The JAXBProcessor class deals with unmarshalling of selected elements. The XSLTProcessor class invokes XSLT transformations. This is how the class performs useful work:
 public class DataOne extends JAXBProcessor<DtypeOne> { private static final String TAG_NAME = "data/dtype_one"; //  public DataOne() throws JAXBException, SAXException { super(DtypeOne.class, TAG_NAME); } //       public DataOne(String schemaFileName) throws JAXBException, SAXException { super(DtypeOne.class, TAG_NAME, schemaFileName); } //     XML  @Override public void doWork(DtypeOne element) { // System.out.println(element.getP1()); } } 

An example of using XSLT DataThreeXSLT .

Startup example (processing 277 megabyte file is emulated):
 JAXB unmarshall without schema validation
 Runtime: 8034ms, 277000015 bytes processed
 Used Memory: 80MB
 JAXB unmarshall with schema validation
 Runtime: 66180ms, 277000015 bytes processed
 Used Memory: 56MB
 XSLT processing
 Runtime: 10604ms, 277000015 bytes processed
 Used Memory: 231MB

With memory, everything is good, validation of course greatly slows down the processing.

Ps. For tests I used Mockito (I used jmock before). I liked the possibility of spy - interception of calls and their parameters when working with a live (non-mock) object.
Pps. Project ID on github .

Source: https://habr.com/ru/post/185360/


All Articles