📜 ⬆️ ⬇️

How I learned to work with XML

image
Honestly, I was pretty surprised, not finding an article on a similar topic on Habré. And the topic is quite relevant and necessary, so I will take the liberty to light it up a bit.


Short excursion


When working with xml in python, many people use the rather convenient built-in module xml.dom.minidom . All the information in it, including the contents of the tags, is presented as such nodes , work with which is conducted directly. Here is a piece of code handling the xml file:
Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  1. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  2. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  3. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  4. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  5. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  6. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  7. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()
  8. Copy Source | Copy HTML app_xml = xmlp.parse( "base.xml" ) id = app_xml.createElement(att) node = app_xml.createTextNode("simple.App") id .appendChild(node) root.appendChild(id) res = open ( "base.xml" , "w" ) res .writelines(app_xml.toprettyxml()) res .close()

All this code, as it is easy to guess, opens the existing xml-file, parses and adds to it a single id tag with the string “simple.App” . Bulky? Yes, not the word. Moreover, another unpleasant bug was discovered - since the time of Python 2.4, the toprettyxml () function, designed to output the contents of a node or tree of nodes in text form, for some reason adds a carriage transfer character to each line, resulting in
<id> simple.App </ id>

issued
<id>
simple.App
</ id>

At first glance, this is not critical, since the value remains intact, however, in some cases, and for some parsers (if, for example, you intend to use generated XML in other designs), an error will be displayed during the processing of numerical data. In particular, this is the fault of the Adobe AIR applications collector, to which I actually wrote the binding.
Searches on the Internet gave the result. It turned out that either I had to use a hack in my code as an extra function of twenty lines of lines, or use standard toxml () , which though generates valid files - but all the information in them is in one line, that is, my entire descriptor has become into the porridge of the species
Copy Source | Copy HTML
  1. <? xml version = '1.0' encoding = 'utf-8' ? > < application xmlns = "http://ns.adobe.com/air/application/1.5"> < id > simple.test.program </ id > < version > 0.1 </ version > < filename > testapp </ filename > < name > testapp </ name > < initialWindow > < title > Test AIR Application </ title > < content > test.html </ content > < height > 320 </ height > < width > 240 </ width > < visible > true </ visible > < resizable > true </ resizable > </ initialWindow > </ application >

Call me an esthete, but for large file sizes (and even with comments), it’s still a pleasure to look for erroneous values ​​in such a heap.
And then I began to look for an alternative option.

A light in the end of a tunnel


And this option came in the form of the lxml module. He was mentioned just in the topic in which xml.dom.minidom was scolded for the disgrace he was creating :)
And now let's take a look at the code for generating the application handle without any additions and rewrites:
Copy Source | Copy HTML
  1. root = etree.Element ( "initialWindow" )
  2. etree.SubElement (root, "title" ) .text = title
  3. etree.SubElement (root, "content" ) .text = content
  4. etree.SubElement (root, "height" ) .text = str (height)
  5. etree.SubElement (root, "width" ) .text = str (width)
  6. app_window = etree.tostring (root)
  7. ...
  8. root = etree.Element ( "application" , xmlns = "http://ns.adobe.com/air/application/1.5" )
  9. etree.SubElement (root, "id" ) .text = id
  10. etree.SubElement (root, "version" ) .text = version
  11. etree.SubElement (root, "filename" ) .text = filename
  12. etree.SubElement (root, "name" ) .text = self .name
  13. root.append (etree.XML (app_window))
  14. handle = etree.tostring (root, pretty_print = True, encoding = 'utf-8' , xml_declaration = True)
  15. applic = open ( self .fullpath + "/" + self .name + "-app.xml" , "w" )
  16. applic.writelines (handle)
  17. applic.close ()

Where is clearer and clearer, is not it? This code allows you to generate such a clean and pretty XML:
Copy Source | Copy HTML
  1. <? xml version = '1.0' encoding = 'utf-8' ? >
  2. < application xmlns = "http://ns.adobe.com/air/application/1.5">
  3. < id > simple.test.program </ id >
  4. < version > 0.1 </ version >
  5. < testapp </ filename >
  6. < name > testapp </ name >
  7. < initialWindow >
  8. < title > Test AIR Application </ title >
  9. < content > test.html </ content >
  10. < height > 320 </ height >
  11. < width > 240 </ width >
  12. </ initialWindow >
  13. </ application >

I think 90% of the code does not need an explanation. All street magic is enclosed in the line handle = etree.tostring (root, pretty_print = True, encoding = 'utf-8', xml_declaration = True) . Here pretty_print is a replacement for the same ill-fated toprettyxml () , only, unlike it, it works fine. We also set the encoding and attach the standard header line of the XML document.
')
Allegedly, this module, oddly enough, works twice as fast as the standard one. It is installed elementarily via setuptools :
$ sudo easy_install lxml

Normal tutorial lies on the official site, here is a direct link .

Long live the beautiful, convenient and valid code! Good luck to all of you, write comments.

Source: https://habr.com/ru/post/61523/


All Articles