📜 ⬆️ ⬇️

How I parsed docx using XSLT

The task of processing documents in docx format, as well as xlsx tables and pptx presentations is very nontrivial. In this article I will tell you how to learn to parse, create and process such documents using only XSLT and ZIP archiver.


What for?


docx is the most popular document format, so the task to give information to the user in this format can always arise. One solution to this problem is to use a ready-made library, which may not be suitable for a number of reasons:



Therefore, in this article we will use only the most basic tools for working with the docx document.


Docx structure


First, let's look at what the docx document is. docx is a zip archive that physically contains 2 types of files:



And logically - 3 types of elements:



They are described in detail in ECMA-376: Office Open XML File Formats , the main part of which is a PDF document of 5000 pages, and another 2000 pages of bonus content.


Minimum docx


The simplest docx after unpacking is as follows


image


Let's see what it consists of.


[Content_Types] .xml


It is located at the root of the document and lists the MIME content types of the document:


 <Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> </Types> 

_rels / .rels


Main list of document links. In this case, only one link is defined - matching with the identifier rId1 and the word / document.xml file - the main body of the document.


 <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/> </Relationships> 

word / document.xml


The main content of the document .


word / document.xml
 <w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"> <w:body> <w:pw:rsidR="005F670F" w:rsidRDefault="005F79F5"> <w:r> <w:t>Test</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="005F670F"> <w:pgSz w:w="12240" w:h="15840"/> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/> <w:cols w:space="720"/> <w:docGrid w:linePitch="360"/> </w:sectPr> </w:body> </w:document> 

Here:



If you open this document in a text editor, you will see a document from one word Test .


word / _rels / document.xml.rels


Here is a list of links to the word/document.xml . The name of the file of links is created from the name of the part of the document to which it belongs and adding to it the extension rels . The folder with the link file is called _rels and is on the same level as the part to which it belongs. Since there are no links in word/document.xml this is also empty in the file:


 <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> </Relationships> 

Even if there are no links, this file must exist.


docx and Microsoft Word


docx created using Microsoft Word, but in principle and using any other editor has a few additional files .


image


Here is what they contain:



In complex documents parts can be much more.


Reverse engineering docx


So, the initial task is to find out how any fragment of the document is stored in xml, in order to create (or parse) similar documents on its own. For this we need:



Instruments



You also need scripts to automatically (once) archive and format XML.
Use under Windows:



Using under Linux is similar, only ./unpack.sh instead of unpack , and pack becomes ./pack.sh .


Using


The search for changes is as follows:


  1. Create an empty docx file in the editor.
  2. Unpack it using unpack in a new folder.
  3. Let's commit a new folder.
  4. Add to the file from p. 1. the studied element (hyperlink, table, etc.).
  5. Unpack the modified file into an existing folder.
  6. We study diff, removing unnecessary changes (permutations of connections, the order of namespaces, etc.).
  7. We pack the folder and check that the resulting file opens.
  8. Commit the changed folder.

Example 1. Bold text selection


Let’s see in practice how to find a tag that defines text formatting in bold.


  1. Create a bold.docx document with plain (non-bold) Test text.
  2. Unpack it: unpack bold.docx bold .
  3. Commit the result .
  4. Select the text Test in bold.
  5. Unpack the unpack bold.docx bold .
  6. Initially, diff looked like this:

diff
Consider it in detail:


docProps / app.xml


 @@ -1,9 +1,9 @@ - <TotalTime>0</TotalTime> + <TotalTime>1</TotalTime> 

We do not need to change the time.


docProps / core.xml


 @@ -4,9 +4,9 @@ - <cp:revision>1</cp:revision> + <cp:revision>2</cp:revision> <dcterms:created xsi:type="dcterms:W3CDTF">2017-02-07T19:37:00Z</dcterms:created> - <dcterms:modified xsi:type="dcterms:W3CDTF">2017-02-07T19:37:00Z</dcterms:modified> + <dcterms:modified xsi:type="dcterms:W3CDTF">2017-02-08T10:01:00Z</dcterms:modified> 

Changing the document version and the modification date does not interest us either.


word / document.xml


diff
 @@ -1,24 +1,26 @@ <w:body> - <w:pw:rsidR="0076695C" w:rsidRPr="00290C70" w:rsidRDefault="00290C70"> + <w:pw:rsidR="0076695C" w:rsidRPr="00F752CF" w:rsidRDefault="00290C70"> <w:pPr> <w:rPr> + <w:b/> <w:lang w:val="en-US"/> </w:rPr> </w:pPr> - <w:r> + <w:rw:rsidRPr="00F752CF"> <w:rPr> + <w:b/> <w:lang w:val="en-US"/> </w:rPr> <w:t>Test</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> - <w:sectPr w:rsidR="0076695C" w:rsidRPr="00290C70"> + <w:sectPr w:rsidR="0076695C" w:rsidRPr="00F752CF"> 

The changes in w:rsidR not interesting - this is inside information for Microsoft Word. Key change here


  <w:rPr> + <w:b/> 

in the paragraph with Test. Apparently the element <w:b/> makes the text bold. We leave this change and cancel the rest.


word / settings.xml


 @@ -1,8 +1,9 @@ + <w:proofState w:spelling="clean"/> @@ -17,10 +18,11 @@ + <w:rsid w:val="00F752CF"/> 

Also does not contain anything related to bold text. We cancel.


7 We pack a folder with a 1m change (by adding <w:b/> ) and check that the document opens and shows what was expected.
8 Commit change .


Example 2. Footer


Now let's look at the more complicated example - adding a footer.
Here is the initial commit . Add a footer with the text 123 and unpack the document. Such a diff is obtained initially:


diff


Immediately we exclude changes in docProps/app.xml and docProps/core.xml - there is the same as in the first example.


[Content_Types] .xml


 @@ -4,10 +4,13 @@ <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> + <Override PartName="/word/footnotes.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/> + <Override PartName="/word/endnotes.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/> + <Override PartName="/word/footer1.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/> 

The footer obviously looks like what we need, but what to do with footnotes and endnotes? Are they required when adding a footer or created at the same time? Answering this question is not always easy, here are the main ways:



Read the documentation
Go ahead for now.


word / _rels / document.xml.rels


Initially, the diff looks like this:


diff
 @@ -1,8 +1,11 @@ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> + <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/> <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/> + <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/> <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/> <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/> - <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/> - <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/> + <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/> + <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes" Target="endnotes.xml"/> + <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes" Target="footnotes.xml"/> </Relationships> 

You can see that some of the changes are related to the fact that Word changed the order of the links, remove them:


 @@ -3,6 +3,9 @@ + <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/> + <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes" Target="endnotes.xml"/> + <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes" Target="footnotes.xml"/> 

Again appear footer, footnotes, endnotes. All of them are connected with the main document, let's move to it:


word / document.xml


 @@ -15,10 +15,11 @@ </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="0076695C" w:rsidRPr="00290C70"> + <w:footerReference w:type="default" r:id="rId6"/> <w:pgSz w:w="11906" w:h="16838"/> <w:pgMar w:top="1134" w:right="850" w:bottom="1134" w:left="1701" w:header="708" w:footer="708" w:gutter="0"/> <w:cols w:space="708"/> <w:docGrid w:linePitch="360"/> </w:sectPr> 

A rare case when there are only necessary changes. The explicit link to the footer from sectPr is visible . And since there are no references in the document to footnotes and endnotes, we can assume that we will not need them.


word / settings.xml


diff
 @@ -1,19 +1,30 @@ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <w:settings xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:sl="http://schemas.openxmlformats.org/schemaLibrary/2006/main" mc:Ignorable="w14 w15"> <w:zoom w:percent="100"/> + <w:proofState w:spelling="clean"/> <w:defaultTabStop w:val="708"/> <w:characterSpacingControl w:val="doNotCompress"/> + <w:footnotePr> + <w:footnote w:id="-1"/> + <w:footnote w:id="0"/> + </w:footnotePr> + <w:endnotePr> + <w:endnote w:id="-1"/> + <w:endnote w:id="0"/> + </w:endnotePr> <w:compat> <w:compatSetting w:name="compatibilityMode" w:uri="http://schemas.microsoft.com/office/word" w:val="15"/> <w:compatSetting w:name="overrideTableStyleFontSizeAndJustification" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> <w:compatSetting w:name="enableOpenTypeFeatures" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> <w:compatSetting w:name="doNotFlipMirrorIndents" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> <w:compatSetting w:name="differentiateMultirowTableHeaders" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> </w:compat> <w:rsids> <w:rsidRoot w:val="00290C70"/> + <w:rsid w:val="000A7B7B"/> + <w:rsid w:val="001B0DE6"/> 

And now there are links to footnotes, endnotes that add them to the document.


word / styles.xml


diff
 @@ -480,6 +480,50 @@ <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/> <w:b/> <w:sz w:val="28"/> </w:rPr> </w:style> + <w:style w:type="paragraph" w:styleId="a4"> + <w:name w:val="header"/> + <w:basedOn w:val="a"/> + <w:link w:val="a5"/> + <w:uiPriority w:val="99"/> + <w:unhideWhenUsed/> + <w:rsid w:val="000A7B7B"/> + <w:pPr> + <w:tabs> + <w:tab w:val="center" w:pos="4677"/> + <w:tab w:val="right" w:pos="9355"/> + </w:tabs> + <w:spacing w:after="0" w:line="240" w:lineRule="auto"/> + </w:pPr> + </w:style> + <w:style w:type="character" w:customStyle="1" w:styleId="a5"> + <w:name w:val="  "/> + <w:basedOn w:val="a0"/> + <w:link w:val="a4"/> + <w:uiPriority w:val="99"/> + <w:rsid w:val="000A7B7B"/> + </w:style> + <w:style w:type="paragraph" w:styleId="a6"> + <w:name w:val="footer"/> + <w:basedOn w:val="a"/> + <w:link w:val="a7"/> + <w:uiPriority w:val="99"/> + <w:unhideWhenUsed/> + <w:rsid w:val="000A7B7B"/> + <w:pPr> + <w:tabs> + <w:tab w:val="center" w:pos="4677"/> + <w:tab w:val="right" w:pos="9355"/> + </w:tabs> + <w:spacing w:after="0" w:line="240" w:lineRule="auto"/> + </w:pPr> + </w:style> + <w:style w:type="character" w:customStyle="1" w:styleId="a7"> + <w:name w:val="  "/> + <w:basedOn w:val="a0"/> + <w:link w:val="a6"/> + <w:uiPriority w:val="99"/> + <w:rsid w:val="000A7B7B"/> + </w:style> </w:styles> 

Changes in styles interest us only if we are looking for how to change the style. In this case, this change can be removed.


word / footer1.xml


Now let's look at the footer itself (some of the namespaces are omitted for readability, but they must be in the document):


 <w:ftr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:pw:rsidR="000A7B7B" w:rsidRDefault="000A7B7B"> <w:pPr> <w:pStyle w:val="a6"/> </w:pPr> <w:r> <w:t>123</w:t> </w:r> </w:p> </w:ftr> 

Here the text 123 is visible. The only thing that needs to be corrected is to remove the link to <w:pStyle w:val="a6"/> .


As a result of the analysis of all changes, we make the following assumptions:



Reduce the diff to this changeset:


final diff


Then we pack the document and open it.
If everything is done correctly, the document will open and there will be a footer with the text 123. And here is the final commit .


Thus, the process of finding changes is reduced to finding a minimum set of changes sufficient to achieve a given result.


Practice


Having found the change we are interested in, it is logical to proceed to the next stage, it could be any of:



Here we need knowledge of XSLT and XPath .


Let's write a fairly simple conversion — replacing or adding a footer to an existing document. I will write in the language Caché ObjectScript, but even if you do not know it - it does not matter. Basically, we will call XSLT and the archiver. Nothing more. So let's get started.


Algorithm


The algorithm is as follows:


  1. Unpack the document.
  2. Add our footer.
  3. We register the link to it in [Content_Types].xml and word/_rels/document.xml.rels .
  4. Add the <w:sectPr> tag to the <w:footerReference> tag in word/document.xml or replace the link to our footer in it.
  5. We pack the document.

Let's get started


Unpacking


Caché ObjectScript has the ability to execute OS commands using the $ zf (-1, oscommand) function . Call unzip to unpack the document using a wrapper over $ zf (-1) :


 ///  %3 (unzip)   %1   %2 Parameter UNZIP = "%3 %1 -d %2"; ///   source   targetDir ClassMethod executeUnzip(source, targetDir) As %Status { set timeout = 100 set cmd = $$$FormatText(..#UNZIP, source, targetDir, ..getUnzip()) return ..execute(cmd, timeout) } 

Create a footer file


At the input comes the footer text, write it in the file in.xml:


 <xml>TEST</xml> 

In XSLT (file - footer.xsl) we will create a footer with text from the xml tag (part of the namespace is omitted, here is the complete list ):


 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://schemas.openxmlformats.org/package/2006/relationships" version="1.0"> <xsl:output method="xml" omit-xml-declaration="no" indent="yes" standalone="yes"/> <xsl:template match="/"> <w:ftr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:p> <w:r> <w:rPr> <w:lang w:val="en-US"/> </w:rPr> <w:t> <xsl:value-of select="//xml/text()"/> </w:t> </w:r> </w:p> </w:ftr> </xsl:template> </xsl:stylesheet> 

Now call the XSLT converter :


 do ##class(%XML.XSLT.Transformer).TransformFile("in.xml", "footer.xsl", footer0.xml") 

The result is the footer0.xml footer footer0.xml :


 <w:ftr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:p> <w:r> <w:rPr> <w:lang w:val="en-US"/> </w:rPr> <w:t>TEST</w:t> </w:r> </w:p> </w:ftr> 

Add a footer link to the main document link list.


Links with identifier rId0 usually do not exist. However, you can use XPath to get an identifier that definitely does not exist.
Add a link to footer0.xml with identifier rId0 in word/_rels/document.xml.rels :


Xslt
 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://schemas.openxmlformats.org/package/2006/relationships" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes" indent="no" /> <xsl:param name="new"> <Relationship Id="rId0" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer0.xml"/> </xsl:param> <xsl:template match="/*"> <xsl:copy> <xsl:copy-of select="$new"/> <xsl:copy-of select="@* | node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet> 

Register links in the document


Next you need to add the <w:footerReference> tag to each <w:sectPr> tag or replace the link to our footer in it. It turned out that each <w:sectPr> can have 3 <w:footerReference> - for the first page, even pages, and everything else:


Xslt
 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" /> <xsl:template match="//@* | //node()"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates select="node()"/> </xsl:copy> </xsl:template> <xsl:template match="//w:sectPr"> <xsl:element name="{name()}" namespace="{namespace-uri()}"> <xsl:copy-of select="./namespace::*"/> <xsl:apply-templates select="@*"/> <xsl:copy-of select="./*[local-name() != 'footerReference']"/> <w:footerReference w:type="default" r:id="rId0"/> <w:footerReference w:type="first" r:id="rId0"/> <w:footerReference w:type="even" r:id="rId0"/> </xsl:element> </xsl:template> </xsl:stylesheet> 

Add footer to [Content_Types].xml


Add to [Content_Types].xml information that /word/footer0.xml is of type application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml :


Xslt
 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://schemas.openxmlformats.org/package/2006/content-types" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes" indent="no" /> <xsl:param name="new"> <Override PartName="/word/footer0.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/> </xsl:param> <xsl:template match="/*"> <xsl:copy> <xsl:copy-of select="@* | node()"/> <xsl:copy-of select="$new"/> </xsl:copy> </xsl:template> </xsl:stylesheet> 

As a result


All code is published . It works like this:


 do ##class(Converter.Footer).modifyFooter("in.docx", "out.docx", "TEST") 

Where:



findings


Using only XSLT and ZIP, you can successfully work with docx documents, xlsx tables and pptx presentations.


Open questions


  1. Initially I wanted to use 7z instead of zip / unzip t ... k. This is one utility and it is more common on Windows. However, I ran into such a problem that documents packed with 7z for Linux cannot be opened in Microsoft Office. I tried a lot of call options , but I could not achieve a positive result.
  2. I am looking for XSD with ECMA-376 version 5 schemas and comments. XSD version 5 without comments is available for download on the ECMA website, but without comments it is difficult to understand. XSD version 2 with comments is available for download.

Links



')

Source: https://habr.com/ru/post/321044/


All Articles