Python, xlsx with pictures, macros, media-art, embedded objects

Greetings, habrosoobschestvo.

I decided to write a little about python work with xlsx.
Yes, I know that a lot has been written about this, but nevertheless, I could not find a ready-made solution for my task.
You need to fill in a few dozen cells in the finished xls file.

Everything would be fine, but the file contained macros, images, media-art objects, embedded objects and much more.
After going through the popular solutions (openpyxl, xlutils.copy and a couple of others), I realized that they spoil the file a bit.
And the task was reduced precisely to a point correction of the cells and so that everything else remained as it was.
')

Prehistory

Unexpectedly, one bank in one country does not have an information system.
for making a decision on granting a loan.
It is replaced by the excel file, which was inherited from time immemorial
and went through the hands of "a little more than a lot" of "Excel programmers."

And you need to somehow transfer to the bank loan applications.
That is, you need to take this file, drive some data into it and send it by mail.

I thought “I am not the first, nor am I the last” and went to look for ready-made solutions.
Everything was not so simple.
openpyxl:
for some reason, I didn’t completely save the stroke of the cells that I didn’t even touch (leaving the border only at the bottom left corner)
Even with keep_vba = True, “broken” xlsm files (which may contain macros) were obtained.
xlutils:
I lost pictures and media-art objects and needed to work with xls, instead of xlsm
Having tried a couple more options, I decided to arm myself with a file, spanners and start writing my bicycle with a more comfortable saddle shape, which does not strive to get rid of a couple of unnecessary parts in the process of using for its intended purpose.

In fact, the task is extremely simple
because it’s just a zip archive with a bunch of XML files.
We just need to make a few changes and pack it all back, leaving everything else as it was.

So let's go

We need ZipFile, lxml and nothing else for this.

import os import re from datetime import datetime, date from StringIO import StringIO from zipfile import ZipFile, ZIP_DEFLATED from lxml import etree

Constructor.
Immediately find out where our sheets are located by calling self._get_sheet_locations ()
compile the regular list to find the row number from the cell index

 def __init__(self, zip_folder): u""" @param zip_folder:   ,      xlsx    '[Content_Types].xml'      """ self._zip_folder = zip_folder self._data = {} self._zip_stream = StringIO() self._row_finder = re.compile(r'\d+$') self._namespaces = { 'ws': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main', 'rel': 'http://schemas.openxmlformats.org/package/2006/relationships' } self._sheet_paths = self._get_sheet_locations() #   self._shared_strings = None self._shared_strings_root = None self._shared_strings_index = None

We do a function that gets us objects lxml etree
on the way to the xml file, so as not to do the same actions

 def _get_xml(self, file_path): u"""  XML-     @param file_path:       """ return etree.parse(os.path.join(self._zip_folder, file_path))

Strings in xlsx are stored in a separate file.
In the cells (xml nodes) the indexes on the entries in the dictionary are physically stored.
This is a bit like an archiver.

Initialize the dictionary.
Convert its xml object, convenient for adding new values.

This is not very correct, but we will not look for whether there is a string in the dictionary and reuse it.
Since the volume of changes is small (compared to the number of lines that already exist in the file),
we just add another value to the dictionary.

We will need
_shared_strings (then from it we will save the modified dictionary)
_shared_strings_root - we will add new lines to it
_shared_strings_index - current index (to track the count of the added string)

 def _init_shared_strings(self): u"""       .  -        .     self._add_shared_string """ self._shared_strings = self._get_xml('xl/sharedStrings.xml') self._shared_strings_root = self._shared_strings.xpath('/ws:sst', namespaces=self._namespaces)[0] self._shared_strings_index = int(self._shared_strings_root.attrib['uniqueCount'])

Add a string to the dictionary and return its index.
This will be necessary to make changes to the string values.

 def _add_shared_string(self, value): u"""     sharedStrings    ,      .  -  -     . uniqueCount  Count   (    ) @param value:      @return:    sharedStrings """ if self._shared_strings is None: self._init_shared_strings() node_t = etree.Element('t') node_t.text = value node_si = etree.Element('si') node_si.append(node_t) self._shared_strings_root.append(node_si) self._shared_strings_index += 1 return (self._shared_strings_index - 1)

We collect data on where a sheet is in the archive.
This data is distributed across 2 XML files:
- xl / workbook.xml
where information is stored that is in the book
- xl / _rels / workbook.xml.rels
where is the information about where it all lies

 def _get_sheet_locations(self): u"""     @return: . {_: __xml} """ #  sheets_id = {} workbook_xml = self._get_xml('xl/workbook.xml') for sheet_xml in workbook_xml.xpath('/ws:workbook/ws:sheets/ws:sheet', namespaces=self._namespaces): sheet_name = sheet_xml.attrib['name'] sheet_rid = sheet_xml.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'] sheets_id[sheet_rid] = sheet_name #   paths = {} xml = self._get_xml('xl/_rels/workbook.xml.rels') for node in xml.xpath('/rel:Relationships/rel:Relationship', namespaces=self._namespaces): r_id = node.attrib['Id'] path = os.path.join('xl', node.attrib['Target']) if r_id in sheets_id: sheet_label = sheets_id[r_id] paths[sheet_label] = path return paths

Create a zip-archive, collecting in it all the files that have not changed (without exclude_files)

 def _create_base_zip(self, exclude_files): u"""         zip_folder  .        .       _add_changes @param exclude_files:    @return:  ZipFile """ zip_file = ZipFile(self._zip_stream, mode='a', compression=ZIP_DEFLATED) for path, dirs, files in os.walk(self._zip_folder): rel_path = path[len(self._zip_folder):] for file_name in files: if rel_path == '': zip_name = file_name else: zip_name = os.path.join(rel_path, file_name) if zip_name not in exclude_files: zip_file.write(os.path.join(path, file_name), zip_name) return zip_file

Add the modified sheets to the zip file, passed as an argument.
We look at all the accumulated data on the sheets, request the modified sheets and add them to the archive.

 def _add_changes(self, zip_file): u"""  .         zip     @param zip_file:  ZipFile  ,      """ #     for sheet_name, data in self._data.items(): sheet_file = self._sheet_paths[sheet_name] sheet_content = self._get_changed_sheet(sheet_file=sheet_file, data=data) zip_file.writestr(sheet_file, sheet_content)

Make changes to the worksheet and return the XML string as amended.

 def _get_changed_sheet(self, sheet_file, data): u"""         ZIP- @param sheet_file:   xml-   @param data:    {cell: value} @return: xml-    """ xml = etree.parse(os.path.join(self._zip_folder, sheet_file)) for cell, value in data.items(): self._change_cell(xml, cell, value) return etree.tostring(xml, xml_declaration=True, encoding="UTF-8", standalone="yes")

We edit a single cell, taking the sheet, the cell address and the value as input.
There is a very important point.
The cell must be in XML (because we are considering exactly the change in the values in the cells).
This cell is usually formatted and colored.
This solution does not consider the absence of a node with a cell.
You can always add the processing of cell presence in the row.
or the line itself in the book.

 def _change_cell(self, xml, cell, value): u"""      xml       xml @param xml:  lxml @param cell:     "C2" @param value:  """ row_index = self._row_finder.search(cell).group() value_type = type(value) pattern_params = {'row_index': row_index, 'cell': cell} pattern = '/ws:worksheet/ws:sheetData/ws:row[@r="%(row_index)s"]/ws:c[@r="%(cell)s"]' % pattern_params node_c = xml.xpath(pattern, namespaces=self._namespaces)[0] node_v = node_c.find('ws:v', namespaces=self._namespaces) #     -    if node_v is None: node_v = etree.Element('v') node_c.append(node_v) #  if value == None: node_c.remove(node_v) if node_c.attrib.get('t') == 's': del node_c.attrib['t'] #   elif value_type in (unicode, str): value = str(self._add_shared_string(value)) node_c.attrib['t'] = 's' #       else: if node_c.attrib.get('t') == 's': del node_c.attrib['t'] if value_type == datetime: value = value.date() if value_type == date: value = (value - date(1899, 12, 30)).days node_v.text = unicode(value)

There are only two public methods.

We collect changes to the dictionary. At this stage we do not make changes.
Only simple data types can be transferred here:
- None
- Number (int, float)
- String (str, unicode)
- Date

 def write(self, sheet, cell, value): u"""           .       - xlsx @param sheet:   @param cell:   ( C4) @param value:      """ if value is not None and type(value) not in (int, float, str, unicode): raise TypeError(u' None, int, float, str, unicode') if sheet not in self._data: self._data[sheet] = {} self._data[sheet][cell] = value

Get the content of the archive.
I didn’t do it as a file, zip-archive, because the content is more universal
and it is convenient to give via HttpResponse

 def get_content(self): u"""    xlsx  . ,     ,   """ exclude_files = ['/%s' % e[1] for e in self._sheet_paths.items() if e[0] in self._data.keys()] exclude_files.append('/xl/sharedStrings.xml') zip_file = self._create_base_zip(exclude_files=exclude_files) self._add_changes(zip_file) zip_file.writestr('xl/sharedStrings.xml', etree.tostring(self._shared_strings, xml_declaration=True, encoding="UTF-8", standalone="yes")) zip_file.close() return self._zip_stream.getvalue()

It seems, and all.
Use it like this:

 xlsx = XLSXEdit('path_to_unzip_folder') xlsx.write('Sheet1', 'A1', 333) xlsx.write('Sheet1', 'A2', 44444) xlsx.write('Sheet1', 'A3', datetime.now()) xlsx.write('Sheet1', 'A4', u'') with open('/Users/dibrovsd/Desktop/out.xlsx', 'w') as zip_file: zip_file.write(xlsx.get_content())

I think that the biggest advantage of the decision is that it is clear what is happening
and there is nothing complicated. If necessary, you can modify the solution,
If in the current form it does not suit you.

The result is this

 import os import re from datetime import datetime, date from StringIO import StringIO from zipfile import ZipFile, ZIP_DEFLATED from lxml import etree class XLSXEdit(object): u"""   xlsx     xml :   ,        , ,  -,   , MS query  (-  excel)         :      XML       """ def __init__(self, zip_folder): u""" @param zip_folder:   ,      xlsx    '[Content_Types].xml'      """ self._zip_folder = zip_folder self._data = {} self._zip_stream = StringIO() self._row_finder = re.compile(r'\d+$') self._namespaces = { 'ws': 'http://schemas.openxmlformats.org/spreadsheetml/2006/main', 'rel': 'http://schemas.openxmlformats.org/package/2006/relationships' } self._sheet_paths = self._get_sheet_locations() #   self._shared_strings = None self._shared_strings_root = None self._shared_strings_index = None def write(self, sheet, cell, value): u"""           .       - xlsx @param sheet:   @param cell:   ( C4) @param value:      """ if value is not None and type(value) not in (int, float, str, unicode, date, datetime): raise TypeError(u' None, int, float, str, unicode') if sheet not in self._data: self._data[sheet] = {} self._data[sheet][cell] = value def get_content(self): u"""    xlsx  . ,     ,   """ exclude_files = ['/%s' % e[1] for e in self._sheet_paths.items() if e[0] in self._data.keys()] exclude_files.append('/xl/sharedStrings.xml') zip_file = self._create_base_zip(exclude_files=exclude_files) self._add_changes(zip_file) zip_file.writestr('xl/sharedStrings.xml', etree.tostring(self._shared_strings, xml_declaration=True, encoding="UTF-8", standalone="yes")) zip_file.close() return self._zip_stream.getvalue() def _get_xml(self, file_path): u"""  XML-     @param file_path:       """ return etree.parse(os.path.join(self._zip_folder, file_path)) def _init_shared_strings(self): u"""       .  -        .     self._add_shared_string """ self._shared_strings = self._get_xml('xl/sharedStrings.xml') self._shared_strings_root = self._shared_strings.xpath('/ws:sst', namespaces=self._namespaces)[0] self._shared_strings_index = int(self._shared_strings_root.attrib['uniqueCount']) def _add_shared_string(self, value): u"""     sharedStrings    ,      .  -  -     . uniqueCount  Count   (    ) @param value:      @return:    sharedStrings """ if self._shared_strings is None: self._init_shared_strings() node_t = etree.Element('t') node_t.text = value node_si = etree.Element('si') node_si.append(node_t) self._shared_strings_root.append(node_si) self._shared_strings_index += 1 return (self._shared_strings_index - 1) def _get_sheet_locations(self): u"""     @return: . {_: __xml} """ #  sheets_id = {} workbook_xml = self._get_xml('xl/workbook.xml') for sheet_xml in workbook_xml.xpath('/ws:workbook/ws:sheets/ws:sheet', namespaces=self._namespaces): sheet_name = sheet_xml.attrib['name'] sheet_rid = sheet_xml.attrib['{http://schemas.openxmlformats.org/officeDocument/2006/relationships}id'] sheets_id[sheet_rid] = sheet_name #   paths = {} xml = self._get_xml('xl/_rels/workbook.xml.rels') for node in xml.xpath('/rel:Relationships/rel:Relationship', namespaces=self._namespaces): r_id = node.attrib['Id'] path = os.path.join('xl', node.attrib['Target']) if r_id in sheets_id: sheet_label = sheets_id[r_id] paths[sheet_label] = path return paths def _create_base_zip(self, exclude_files): u"""         zip_folder  .        .       _add_changes @param exclude_files:    @return:  ZipFile """ zip_file = ZipFile(self._zip_stream, mode='a', compression=ZIP_DEFLATED) for path, dirs, files in os.walk(self._zip_folder): rel_path = path[len(self._zip_folder):] for file_name in files: if rel_path == '': zip_name = file_name else: zip_name = os.path.join(rel_path, file_name) if zip_name not in exclude_files: zip_file.write(os.path.join(path, file_name), zip_name) return zip_file def _add_changes(self, zip_file): u"""  .         zip     @param zip_file:  ZipFile  ,      """ #     for sheet_name, data in self._data.items(): sheet_file = self._sheet_paths[sheet_name] sheet_content = self._get_changed_sheet(sheet_file=sheet_file, data=data) zip_file.writestr(sheet_file, sheet_content) def _get_changed_sheet(self, sheet_file, data): u"""         ZIP- @param sheet_file:   xml-   @param data:    {cell: value} @return: xml-    """ xml = etree.parse(os.path.join(self._zip_folder, sheet_file)) for cell, value in data.items(): self._change_cell(xml, cell, value) return etree.tostring(xml, xml_declaration=True, encoding="UTF-8", standalone="yes") def _change_cell(self, xml, cell, value): u"""      xml       xml @param xml:  lxml @param cell:     "C2" @param value:  """ row_index = self._row_finder.search(cell).group() value_type = type(value) pattern_params = {'row_index': row_index, 'cell': cell} pattern = '/ws:worksheet/ws:sheetData/ws:row[@r="%(row_index)s"]/ws:c[@r="%(cell)s"]' % pattern_params node_c = xml.xpath(pattern, namespaces=self._namespaces)[0] node_v = node_c.find('ws:v', namespaces=self._namespaces) #     -    if node_v is None: node_v = etree.Element('v') node_c.append(node_v) #  if value == None: node_c.remove(node_v) if node_c.attrib.get('t') == 's': del node_c.attrib['t'] #   elif value_type in (unicode, str): value = str(self._add_shared_string(value)) node_c.attrib['t'] = 's' #       else: if node_c.attrib.get('t') == 's': del node_c.attrib['t'] if value_type == datetime: value = value.date() if value_type == date: value = (value - date(1899, 12, 30)).days node_v.text = unicode(value)

Thank you for your attention, if you have read it to the end.

Source: https://habr.com/ru/post/254167/

All Articles

Python, xlsx with pictures, macros, media-art, embedded objects

More articles: