OOXML parser (docx, xlsx, pptx) on Ruby: our errors and findings

We have laid out the parser for OOXML formats on Ruby in open-source. It is available on GitHub and RubyGems.org , free and distributed under the AGPLv3 license. Just like fashionable Ruby developers.

Why we did not use third-party parsers

')
It's no secret that our parser is not the first OOXML parser on Ruby. We could take a third-party product, but decided not to take it. The solutions that we managed to find have a number of problems:

a) they have long been abandoned by developers;
b) they support only basic functionality;
c) they are usually distributed as three separate libraries. Often the docx parser and xlsx parser were made by different people, so their interfaces can be completely different. Agree, it is inconvenient.

What makes our parser different

We wrote it for ourselves and our tasks (testing document editors), but then we realized that maybe it could help other Ruby developers, because it:

a) is actively developing;
b) supports all the functionality of our editors, and this is a lot. Here you can read;
c) is called OOXML parser, as it works with docx, and xlsx, and pptx.

Separately dwell on paragraph b) - functionality. Do we have all the possible features of the standard? Nah The ECMA-376 standard is four volumes and a total of over 9000 pages (none). In fact, about 7 thousand. You can exhale.

In general, you understand: not everything is implemented here. But there is all the most necessary and moreover: paragraphs, tables, autoshapes are recognized. There is support for such complex things as

Color schemes;
Paragraph and table styles;
Built-in charts;
Properties of autoshapes;
Columns;
Lists.

Why was needed a parser

Spoiler - Why do we need parsers at all?
He was born in the testing department.
From the very beginning of automated testing, we adopted a single concept of functional tests.

Take the simplest test:

1. Create a new document.
2. We print the text and set the Bold property.
3. Check that Bold is set.

The ONLYOFFICE editor is written in Canvas, that is, the text in the document is a picture. Check the thickness of the font on the picture is extremely difficult. But you can apply Bold to any font!

In some fonts (such as Arial Black), the Bold may not appear at all visually. Agree that comparing images imagemagick is not the best option.

Therefore, the test verification step was highlighted in a separate paragraph, namely:

4. Download the resulting file in docx format and check that the text parameter is set to Bold.

There are hundreds of such parameters. At the same time, none of the existing solutions supported anything other than the simplest snatching of text, tables, and a couple more things. So we decided to create our own library.

Wait, you ask, you are developing a document editor that can open all these formats for editing! Why not use the ready-made solution from the editor and verify the tests through it?

Why not?

1. In the server part of the editors, the parser is written in C ++, and the whole automated testing process is built on Ruby. On the move it was not entirely clear how to tie it all up with each other.

2. Now we have a version for Linux (and it is the main one), but at the time of the integration of the entire infrastructure for testing, the server part of the documents only supported Windows as a platform. At the same time in testing, we always used Ubuntu and derivatives. To glue this all together, I would have to invent clever schemes.

3. Is it even possible to consider the server parser as a reference? Verify the results of the product using the product itself? Dubious idea.

How the parser works

If you have ever tried to archive a docx file, you might have noticed that the compression ratio is very low. Why is that? It's simple: ooxml files are just an archived set of xml files. Their structure is rather trivial.

For example, create a simple greeting file in our ONLYOFFICE editor and download it to docx. Then unzip it as a zip file and see where the meat of this document is interesting to us.

We will see the following structure:

#tree ├── [Content_Types].xml ├── docProps │ ├── app.xml │ └── core.xml ├── _rels └── word ├── document.xml ├── fontTable.xml ├── _rels │ └── document.xml.rels ├── settings.xml ├── styles.xml ├── theme │ ├── _rels │ │ └── theme1.xml.rels │ └── theme1.xml └── webSettings.xml

We start to dig in the guts. In order.

[Content_Types] .xml - list of mime types in the document. Coldly.

app.xml - document metadata, creator application, statistics. Already warm, the information is interesting, useful.

core.xml - metadata about the latest modifications.

document.xml - Ohh, that's a bingo. The content of our document hides in this file, we will consider it later.

fontTable.xml - table of fonts in the document. It will be useful.

document.xml.rels - a list of all files in the archive; this list will be very useful for complex documents with pictures and graphs.

settings.xml - from the name it is clear that there are stored various parameters of the document, such as default zoom, number separators and so on.

styles.xml, theme1.xml and theme1.xml.rels are very cumbersome, very detailed files containing style and theme parameters. The ability to understand these documents is one of the key features of the product.

webSettings.xml - setting regards to the web version of the document. Not the most popular functionality for docx, we will lower.

So, it turned out that word / document.xml is interesting in a simple document .

Simple xml. The benefit of parsing xml in Ruby is no problem. Take Nokogiri and get a DOM tree. Well, after that, it's a matter of technology, we read the standard (if not laziness, the document is very large), or simply by the good old reverse engineering we will understand where the necessary parameter is hidden in the document.

How to write the parser

At the beginning of our work, we made a number of mistakes, which, as our awareness grew, corrected ourselves. The two most significant mistakes are described below - they are good in the past, and we are no longer ashamed. We hope our experience will help others not to run over the same rake.

Huge files
So, we have the task to process three different document formats. How do we organize the code for this? Of course, three files of 4000 lines of code (in fact, even 4 files of 4000 lines of code, because there were also general methods for formats).

The solution to the problem took the most time. It was necessary to bring all this household into a neat form (although a file for 300 lines still sometimes pops up), to allocate methods into neat classes, etc. We now have over 200 source files instead of four. Edit bugs become easier.

Lack of tests
The logic was this: we are writing a parser to test our main product ONLYOFFICE Document Server, why should we test the parser itself?

NOT. NOT. NOT!!!

Life scene:

- It would be necessary to correct something here, we have the color of the figure incorrectly determined.

- Yes, now, there was a typo there, one letter corrected, commited.

Total:

Everything fell. Parser, editor, the dollar, Humpty Dumpty, self-esteem.

All you had to do was create a dummy `spec`, put a couple of hundred files in there, check out a bunch of parameters to sleep at night and know that the commit you made before leaving work would not break the verification of the option that is set in the menu 3rd level nesting. As we call it “in the third star to the left.”

But we not only missed. Common thoughts we also had. The coolest ones:

Using RuboCop
RuboCop is a static code analyzer for Ruby, and we love it. Very very. And always listen to his opinion. It helps to keep the code in good shape, prevent stupid mistakes and strictly monitor that the code does not become dirtier and worse after the next commit (due to integration through overcommit ).

His work looks like this: after a hard day’s work, you forgot to call variables in Ruby with a small letter and try to commit the code of the form

- path_to_zip_file = copy_file_and_rename_to_zip (path_to_file)
+ ZIP_file = copy_file_and_rename_to_zip (path_to_file)

In this case, an error will occur:

Analyze with RuboCop ............................................ [RuboCop] FAILED
Errors on modified lines:
ooxml_parser / lib / ooxml_parser / common_parser / parser.rb: 8: 7: E: dynamic constant assignment

Zakommitit this code without additional manipulations (`SKIP = RuboCop git commit -av`) will not work. This is excellent foolproof.

Orientation on open-source projects
Practically from the very beginning of parser development, we focused on other open-source projects. Although we were not sure that our code would be uploaded to open source, we were always ready for this. When the “Spread” command arrived, we simply pressed the “make public” button in GitHub and that’s all, no additional combing and so on.

This is the great merit of the same RuboCop: we often peeked into their code, thinking how best to organize this or that topic, for example, Changelog, heme structure. In addition, all development, commits, history of changes and other things were originally conducted in English.

Using the document base
When testing parsers, our previous developments came in handy - a large base with all sorts of strange, huge and incomprehensible files of three formats.

Once upon a time, at an early stage of development of ONLYOFFICE editors, we collected these files on the Internet - they checked the rendering of complex and non-standard documents. A few years later, the parser was driven through the same database of documents. As a result, there were quite a lot of problems of different levels of complexity and, after spending a couple of weeks to eliminate them, we got an excellent product.

Total

So, everything is available, take, add to your Ruby application, parse the docx, build statistics on them, analyze how your xlsx bookkeeping works, find out what memo hid your PM on the product presentation in the fourth slide. And all this for free.

And you can also find problem files and create an issue on GitHub, we will settle it. You can even edit yourself and send Pull Requests.

Source: https://habr.com/ru/post/302826/

All Articles