The task of processing documents in docx format, as well as xlsx tables and pptx presentations is very nontrivial. In this article I will tell you how to learn to parse, create and process such documents using only XSLT and ZIP archiver.
docx is the most popular document format, so the task to give information to the user in this format can always arise. One solution to this problem is to use a ready-made library, which may not be suitable for a number of reasons:
Therefore, in this article we will use only the most basic tools for working with the docx document.
First, let's look at what the docx document is. docx is a zip archive that physically contains 2 types of files:
xml
and rels
And logically - 3 types of elements:
They are described in detail in ECMA-376: Office Open XML File Formats , the main part of which is a PDF document of 5000 pages, and another 2000 pages of bonus content.
The simplest docx after unpacking is as follows
Let's see what it consists of.
It is located at the root of the document and lists the MIME content types of the document:
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> </Types>
Main list of document links. In this case, only one link is defined - matching with the identifier rId1 and the word / document.xml file - the main body of the document.
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="word/document.xml"/> </Relationships>
The main content of the document .
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14"> <w:body> <w:pw:rsidR="005F670F" w:rsidRDefault="005F79F5"> <w:r> <w:t>Test</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="005F670F"> <w:pgSz w:w="12240" w:h="15840"/> <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/> <w:cols w:space="720"/> <w:docGrid w:linePitch="360"/> </w:sectPr> </w:body> </w:document>
Here:
<w:document>
- the document itself<w:body>
- document body<w:p>
- paragraph<w:r>
- run (fragment) of text<w:t>
- the text itself<w:sectPr>
- page descriptionIf you open this document in a text editor, you will see a document from one word Test
.
Here is a list of links to the word/document.xml
. The name of the file of links is created from the name of the part of the document to which it belongs and adding to it the extension rels
. The folder with the link file is called _rels
and is on the same level as the part to which it belongs. Since there are no links in word/document.xml
this is also empty in the file:
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> </Relationships>
Even if there are no links, this file must exist.
docx created using Microsoft Word, but in principle and using any other editor has a few additional files .
Here is what they contain:
docProps/core.xml
- the main metadata of the document according to the Open Packaging Conventions and Dublin Core [1] , [2] .docProps/app.xml
- general information about the document : the number of pages, words, characters, the application name in which the document was created, etc.word/settings.xml
- settings related to the current document .word/styles.xml
- styles applicable to the document. Separate the data from the presentation.word/webSettings.xml
- settings for displaying HTML parts of a document and settings for how to convert a document to HTML.word/fontTable.xml
- a list of fonts used in the document.word/theme1.xml
- theme (consists of color scheme, fonts and formatting).In complex documents parts can be much more.
So, the initial task is to find out how any fragment of the document is stored in xml, in order to create (or parse) similar documents on its own. For this we need:
apt-get install zip unzip libxml2 libxml2-utils git
You also need scripts to automatically (once) archive and format XML.
Use under Windows:
unpack file dir
- unpacks the file
document in the dir
folder and formats the xmlpack dir file
- pack dir file
folder to the file
documentUsing under Linux is similar, only ./unpack.sh
instead of unpack
, and pack
becomes ./pack.sh
.
The search for changes is as follows:
unpack
in a new folder.Let’s see in practice how to find a tag that defines text formatting in bold.
bold.docx
document with plain (non-bold) Test text.unpack bold.docx bold
.unpack bold.docx bold
.
Consider it in detail:
@@ -1,9 +1,9 @@ - <TotalTime>0</TotalTime> + <TotalTime>1</TotalTime>
We do not need to change the time.
@@ -4,9 +4,9 @@ - <cp:revision>1</cp:revision> + <cp:revision>2</cp:revision> <dcterms:created xsi:type="dcterms:W3CDTF">2017-02-07T19:37:00Z</dcterms:created> - <dcterms:modified xsi:type="dcterms:W3CDTF">2017-02-07T19:37:00Z</dcterms:modified> + <dcterms:modified xsi:type="dcterms:W3CDTF">2017-02-08T10:01:00Z</dcterms:modified>
Changing the document version and the modification date does not interest us either.
@@ -1,24 +1,26 @@ <w:body> - <w:pw:rsidR="0076695C" w:rsidRPr="00290C70" w:rsidRDefault="00290C70"> + <w:pw:rsidR="0076695C" w:rsidRPr="00F752CF" w:rsidRDefault="00290C70"> <w:pPr> <w:rPr> + <w:b/> <w:lang w:val="en-US"/> </w:rPr> </w:pPr> - <w:r> + <w:rw:rsidRPr="00F752CF"> <w:rPr> + <w:b/> <w:lang w:val="en-US"/> </w:rPr> <w:t>Test</w:t> </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> - <w:sectPr w:rsidR="0076695C" w:rsidRPr="00290C70"> + <w:sectPr w:rsidR="0076695C" w:rsidRPr="00F752CF">
The changes in w:rsidR
not interesting - this is inside information for Microsoft Word. Key change here
<w:rPr> + <w:b/>
in the paragraph with Test. Apparently the element <w:b/>
makes the text bold. We leave this change and cancel the rest.
@@ -1,8 +1,9 @@ + <w:proofState w:spelling="clean"/> @@ -17,10 +18,11 @@ + <w:rsid w:val="00F752CF"/>
Also does not contain anything related to bold text. We cancel.
7 We pack a folder with a 1m change (by adding <w:b/>
) and check that the document opens and shows what was expected.
8 Commit change .
Now let's look at the more complicated example - adding a footer.
Here is the initial commit . Add a footer with the text 123 and unpack the document. Such a diff is obtained initially:
Immediately we exclude changes in docProps/app.xml
and docProps/core.xml
- there is the same as in the first example.
@@ -4,10 +4,13 @@ <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/word/document.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> + <Override PartName="/word/footnotes.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footnotes+xml"/> + <Override PartName="/word/endnotes.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.endnotes+xml"/> + <Override PartName="/word/footer1.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/>
The footer obviously looks like what we need, but what to do with footnotes and endnotes? Are they required when adding a footer or created at the same time? Answering this question is not always easy, here are the main ways:
Initially, the diff looks like this:
@@ -1,8 +1,11 @@ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships"> + <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/> <Relationship Id="rId3" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/webSettings" Target="webSettings.xml"/> + <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/> <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/settings" Target="settings.xml"/> <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/styles" Target="styles.xml"/> - <Relationship Id="rId5" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/theme" Target="theme/theme1.xml"/> - <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/fontTable" Target="fontTable.xml"/> + <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/> + <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes" Target="endnotes.xml"/> + <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes" Target="footnotes.xml"/> </Relationships>
You can see that some of the changes are related to the fact that Word changed the order of the links, remove them:
@@ -3,6 +3,9 @@ + <Relationship Id="rId6" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer1.xml"/> + <Relationship Id="rId7" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/endnotes" Target="endnotes.xml"/> + <Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footnotes" Target="footnotes.xml"/>
Again appear footer, footnotes, endnotes. All of them are connected with the main document, let's move to it:
@@ -15,10 +15,11 @@ </w:r> <w:bookmarkStart w:id="0" w:name="_GoBack"/> <w:bookmarkEnd w:id="0"/> </w:p> <w:sectPr w:rsidR="0076695C" w:rsidRPr="00290C70"> + <w:footerReference w:type="default" r:id="rId6"/> <w:pgSz w:w="11906" w:h="16838"/> <w:pgMar w:top="1134" w:right="850" w:bottom="1134" w:left="1701" w:header="708" w:footer="708" w:gutter="0"/> <w:cols w:space="708"/> <w:docGrid w:linePitch="360"/> </w:sectPr>
A rare case when there are only necessary changes. The explicit link to the footer from sectPr is visible . And since there are no references in the document to footnotes and endnotes, we can assume that we will not need them.
@@ -1,19 +1,30 @@ <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <w:settings xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:sl="http://schemas.openxmlformats.org/schemaLibrary/2006/main" mc:Ignorable="w14 w15"> <w:zoom w:percent="100"/> + <w:proofState w:spelling="clean"/> <w:defaultTabStop w:val="708"/> <w:characterSpacingControl w:val="doNotCompress"/> + <w:footnotePr> + <w:footnote w:id="-1"/> + <w:footnote w:id="0"/> + </w:footnotePr> + <w:endnotePr> + <w:endnote w:id="-1"/> + <w:endnote w:id="0"/> + </w:endnotePr> <w:compat> <w:compatSetting w:name="compatibilityMode" w:uri="http://schemas.microsoft.com/office/word" w:val="15"/> <w:compatSetting w:name="overrideTableStyleFontSizeAndJustification" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> <w:compatSetting w:name="enableOpenTypeFeatures" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> <w:compatSetting w:name="doNotFlipMirrorIndents" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> <w:compatSetting w:name="differentiateMultirowTableHeaders" w:uri="http://schemas.microsoft.com/office/word" w:val="1"/> </w:compat> <w:rsids> <w:rsidRoot w:val="00290C70"/> + <w:rsid w:val="000A7B7B"/> + <w:rsid w:val="001B0DE6"/>
And now there are links to footnotes, endnotes that add them to the document.
@@ -480,6 +480,50 @@ <w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman"/> <w:b/> <w:sz w:val="28"/> </w:rPr> </w:style> + <w:style w:type="paragraph" w:styleId="a4"> + <w:name w:val="header"/> + <w:basedOn w:val="a"/> + <w:link w:val="a5"/> + <w:uiPriority w:val="99"/> + <w:unhideWhenUsed/> + <w:rsid w:val="000A7B7B"/> + <w:pPr> + <w:tabs> + <w:tab w:val="center" w:pos="4677"/> + <w:tab w:val="right" w:pos="9355"/> + </w:tabs> + <w:spacing w:after="0" w:line="240" w:lineRule="auto"/> + </w:pPr> + </w:style> + <w:style w:type="character" w:customStyle="1" w:styleId="a5"> + <w:name w:val=" "/> + <w:basedOn w:val="a0"/> + <w:link w:val="a4"/> + <w:uiPriority w:val="99"/> + <w:rsid w:val="000A7B7B"/> + </w:style> + <w:style w:type="paragraph" w:styleId="a6"> + <w:name w:val="footer"/> + <w:basedOn w:val="a"/> + <w:link w:val="a7"/> + <w:uiPriority w:val="99"/> + <w:unhideWhenUsed/> + <w:rsid w:val="000A7B7B"/> + <w:pPr> + <w:tabs> + <w:tab w:val="center" w:pos="4677"/> + <w:tab w:val="right" w:pos="9355"/> + </w:tabs> + <w:spacing w:after="0" w:line="240" w:lineRule="auto"/> + </w:pPr> + </w:style> + <w:style w:type="character" w:customStyle="1" w:styleId="a7"> + <w:name w:val=" "/> + <w:basedOn w:val="a0"/> + <w:link w:val="a6"/> + <w:uiPriority w:val="99"/> + <w:rsid w:val="000A7B7B"/> + </w:style> </w:styles>
Changes in styles interest us only if we are looking for how to change the style. In this case, this change can be removed.
Now let's look at the footer itself (some of the namespaces are omitted for readability, but they must be in the document):
<w:ftr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:pw:rsidR="000A7B7B" w:rsidRDefault="000A7B7B"> <w:pPr> <w:pStyle w:val="a6"/> </w:pPr> <w:r> <w:t>123</w:t> </w:r> </w:p> </w:ftr>
Here the text 123 is visible. The only thing that needs to be corrected is to remove the link to <w:pStyle w:val="a6"/>
.
As a result of the analysis of all changes, we make the following assumptions:
[Content_Types].xml
you need to add footerword/_rels/document.xml.rels
you need to add a link to footerword/document.xml
in the <w:sectPr>
you need to add <w:footerReference>
Reduce the diff to this changeset:
Then we pack the document and open it.
If everything is done correctly, the document will open and there will be a footer with the text 123. And here is the final commit .
Thus, the process of finding changes is reduced to finding a minimum set of changes sufficient to achieve a given result.
Having found the change we are interested in, it is logical to proceed to the next stage, it could be any of:
Here we need knowledge of XSLT and XPath .
Let's write a fairly simple conversion — replacing or adding a footer to an existing document. I will write in the language Caché ObjectScript, but even if you do not know it - it does not matter. Basically, we will call XSLT and the archiver. Nothing more. So let's get started.
The algorithm is as follows:
[Content_Types].xml
and word/_rels/document.xml.rels
.<w:sectPr>
tag to the <w:footerReference>
tag in word/document.xml
or replace the link to our footer in it.Let's get started
Caché ObjectScript has the ability to execute OS commands using the $ zf (-1, oscommand) function . Call unzip to unpack the document using a wrapper over $ zf (-1) :
/// %3 (unzip) %1 %2 Parameter UNZIP = "%3 %1 -d %2"; /// source targetDir ClassMethod executeUnzip(source, targetDir) As %Status { set timeout = 100 set cmd = $$$FormatText(..#UNZIP, source, targetDir, ..getUnzip()) return ..execute(cmd, timeout) }
At the input comes the footer text, write it in the file in.xml:
<xml>TEST</xml>
In XSLT (file - footer.xsl) we will create a footer with text from the xml tag (part of the namespace is omitted, here is the complete list ):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://schemas.openxmlformats.org/package/2006/relationships" version="1.0"> <xsl:output method="xml" omit-xml-declaration="no" indent="yes" standalone="yes"/> <xsl:template match="/"> <w:ftr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:p> <w:r> <w:rPr> <w:lang w:val="en-US"/> </w:rPr> <w:t> <xsl:value-of select="//xml/text()"/> </w:t> </w:r> </w:p> </w:ftr> </xsl:template> </xsl:stylesheet>
Now call the XSLT converter :
do ##class(%XML.XSLT.Transformer).TransformFile("in.xml", "footer.xsl", footer0.xml")
The result is the footer0.xml
footer footer0.xml
:
<w:ftr xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <w:p> <w:r> <w:rPr> <w:lang w:val="en-US"/> </w:rPr> <w:t>TEST</w:t> </w:r> </w:p> </w:ftr>
Links with identifier rId0
usually do not exist. However, you can use XPath to get an identifier that definitely does not exist.
Add a link to footer0.xml
with identifier rId0 in word/_rels/document.xml.rels
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://schemas.openxmlformats.org/package/2006/relationships" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes" indent="no" /> <xsl:param name="new"> <Relationship Id="rId0" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/footer" Target="footer0.xml"/> </xsl:param> <xsl:template match="/*"> <xsl:copy> <xsl:copy-of select="$new"/> <xsl:copy-of select="@* | node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
Next you need to add the <w:footerReference>
tag to each <w:sectPr>
tag or replace the link to our footer in it. It turned out that each <w:sectPr>
can have 3 <w:footerReference>
- for the first page, even pages, and everything else:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes" indent="yes" /> <xsl:template match="//@* | //node()"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates select="node()"/> </xsl:copy> </xsl:template> <xsl:template match="//w:sectPr"> <xsl:element name="{name()}" namespace="{namespace-uri()}"> <xsl:copy-of select="./namespace::*"/> <xsl:apply-templates select="@*"/> <xsl:copy-of select="./*[local-name() != 'footerReference']"/> <w:footerReference w:type="default" r:id="rId0"/> <w:footerReference w:type="first" r:id="rId0"/> <w:footerReference w:type="even" r:id="rId0"/> </xsl:element> </xsl:template> </xsl:stylesheet>
[Content_Types].xml
Add to [Content_Types].xml
information that /word/footer0.xml
is of type application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml
:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://schemas.openxmlformats.org/package/2006/content-types" version="1.0"> <xsl:output method="xml" omit-xml-declaration="yes" indent="no" /> <xsl:param name="new"> <Override PartName="/word/footer0.xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.footer+xml"/> </xsl:param> <xsl:template match="/*"> <xsl:copy> <xsl:copy-of select="@* | node()"/> <xsl:copy-of select="$new"/> </xsl:copy> </xsl:template> </xsl:stylesheet>
All code is published . It works like this:
do ##class(Converter.Footer).modifyFooter("in.docx", "out.docx", "TEST")
Where:
in.docx
- source documentout.docx
- outgoing documentTEST
- text that is added to the footer.Using only XSLT and ZIP, you can successfully work with docx documents, xlsx tables and pptx presentations.
Source: https://habr.com/ru/post/321044/
All Articles