Article from February 19, 2008Last week, Microsoft
published the specification of binary file formats for Office. These formats look crazy. The file format of Excel 97-2003 is a 349-page PDF file. And that is not all! The document contains the following comment:
Each sheet of [workbook] in Excel is stored in a compound file.
You see, Excel 97-2003 files are OLE master documents, which in turn represent a kind of file system in a single file. To understand this, you need to read the 9 pages of documentation. And the specifications themselves are more like data structures in C than what we used to call specifications. This is a hierarchical file system.
If you thought that you were reading these formats and, over the weekend, you would jot down a utility for exporting Word documents to your blog, or creating Excel plates based on your personal financial data, then the complexity and length of these specifications should have discouraged you. A normal programmer will decide that the format of binaries from Office:
')
- made confusing specially
- coined by some representative of the cybernetic borg race suffering from senile insanity
- created by insanely bad programmers
- cannot be created or read correctly
And make a mistake. If you dig, you can show how and why these formats have become so complex, why they have nothing to do with bad programming, and what can be done about it.
The first thing you need to understand is that the goals of developers of binary formats were radically different from those of developers, say, HTML.
They had to work very quickly on very old computers . At the time of the first versions of Excel for Windows, 1 megabyte of memory was not uncommon, and the program should work quite comfortably on 80386 processors with a frequency of 20 MHz. Many optimizations are made to speed up the opening and saving of files:
- These are binary file formats, so loading a record usually means copying a sequence of bytes from the disk into memory, in which the C data structure appears. No parsing or lexical analysis of the data takes place, as this is several times slower than simple copying.
- The file format is confused in the right places to speed up typical operations. For example, Excel 95 and 97 had a “simple save” function, which was used as an accelerated version of the OLE document, the full version of which was not too fast for general use. Word had something similar called " quick save ." To quickly save long documents 14 times out of 15, all changes were simply added to the end of the file, and the entire file was not overwritten from scratch. For hard drives of that time, this meant that it was possible to have time to save the document, say, for 1 second instead of 30. It also meant that the deleted parts of the document were still stored in the file — and people, as it turned out, did not need it .
They were designed with libraries in mind . If you needed to write a binary import function from scratch, you needed to support things like Windows Metafile Format (for drawing) and OLE Compound Storage. When working under Windows, there are libraries for them, therefore support for such functions was a trivial matter. But if you write everything from scratch, you would have to do everything yourself.
Office had support for compound documents, for example, it was possible to include a spreadsheet in a Word file. The perfect Word parser should have been able to do something clever with the table turned on.
They were not designed for use in other applications . A reasonably reasonable assumption at that time was that the Word format would only write and read Word. Therefore, when a programmer from the Word development team decided to change the file format, he was only worried about a) speed and b) minimum number of lines in Word code. Ideas such as SGML and HTML, replaceable, open, and standardized formats were not popular until the Internet made such things practical. This moment came 10 years after the development of Office file formats. It was always supposed to use programs for export and import. Word has a format support for easy sharing of documents called
RTF , which existed almost from the very beginning.
They needed to reflect the complexity of the applications . Every tick, every formatting option and every Microsoft Office function needed to be stored in files. Therefore, to create the perfect clone of Word that reads its files, it was necessary to implement all its functions. If you create a program for working with text - a competitor of Word, which should be able to download its files, it may take a little time for you to load the options specified in the file. But the actual display of them all on the page is a more complicated task. And if you don’t solve it, then your customers will open the Word file in your clone, and all formatting may break.
They needed to reflect the history of program development. Many complex things in formats are old, complex, unnecessary and rarely used functions. They are there only for backward compatibility and because developers don’t have to leave the code alone. But in order to thoroughly do the work of disassembling or recording these files, you will have to repeat all this work that has been done at Microsoft for 15 years. Thousands of man-hours of work are invested in current versions of Word and Excel, and to clone these programs you have to invest your thousands of man-hours. The file format is simply a brief summary of all the functions supported by the application.
Just for example, let us analyze one possibility in more detail. An Excel sheet is a collection of different BIFF records. Consider the very first entry in the specification - an entry under the name
1904 .
The specification about this record is written very vaguely. It is simply written that “record 1904 indicates whether the date system is 1904”. A classic example of useless specs. If you were a developer, and stumbled upon such an “explanation”, you would very reasonably have come to the conclusion that Microsoft is hiding something. Such a description is not enough by itself, you need to look for
additional information . I will explain: there are two types of Excel sheets. In some dates, dates start on 1/1/1900 (in them, the same
leap year
error was specifically created for compatibility with the 1-2-3 format), in others - on 1/1/1904. Excel supports both options - since its first version, for Mac, used the second version, which was systemic, and Excel for Windows should have the ability to import 1-2-3 files using dates from 1/1/1900. Already in this place you can cry.
Both types of files, 1900 and 1904, are found in abundance in the wild, depending on whether they came from Mac or Windows. Automatic date conversion can lead to errors, so Excel does not change the file type itself. To disassemble Excel files you have to work with both. And this means that you do not just need to load this bit from the file, but also rewrite all the code for parsing and displaying dates in order to process both options. This is a work for a few days.
While working on the Excel clone, you will find many such hidden details on working with dates. When does Excel convert numbers to dates? How does the formatting work? Why is 1/31 interpreted as January 31 of the current year, and 1/50 as January 1, 1950? All these details can not be described without the resulting description is not equal in volume to the source Excel.
And this is only one of hundreds of BIFF records, and one of the simplest. Most of them are so complex that they can make an adult programmer cry.
The only possible solution would be as follows. Of course, Microsoft did a great service by publishing file formats, but it won't be easier to import them or save them in them. These are insanely complex applications, and you
can’t just implement 20% of the most popular features and expect that 80% of other people will be happy . Binary specifications at best will save you a couple of minutes when reverse engineering a complex system.
But I promised to tell you what to do about it. Almost all popular applications do not need to read and write Office binaries. There are two alternatives: letting Office work itself, or using simpler file formats.
Let Office work by itself . Word and Excel have very complete object models that are available through COM Automation, so you can do everything in the program. In many cases, it is better to reuse code from Office instead of trying to re-write it. Examples:
- You have a web application that needs to output Word files in PDF format. I would do it this way: several lines on Word VBA load the file and save it to PDF through the mechanism built into Word 2007. This code can be called directly, even from ASP or ASP.NET, running under IIS. The first launch of Word will take a few seconds. Subsequently, Word will be in memory under control of the COM system. For ordinary web applications, this system works pretty quickly.
- The same task under Linux. Buy one Windows 2003 server, install the licensed Word on it, and set up a simple web service. Half-day work using C # and ASP.NET
- The same task but with the ability to scale. Set up load balancing before any required number of boxes from the second step. No programming required.
This approach will work with all common tasks related to Office. For example:
- Opening an Excel sheet, saving some data in cells, counting and outputting results.
- Use Excel to create graphs in GIF format
- Retrieving any information from an Excel file without parsing file formats
- Convert Excel file to CSV (another approach is to use Excel ODBC drivers and collect data through SQL queries)
- Editing Word Documents
- Filling out forms in Word
- Convert files between different formats that Office supports (there are possibilities to import dozens of formats of word processors and spreadsheets).
In all these cases, it is possible to explain to Office objects that they are not working interactively, so they do not need to refresh the screen and request user input. There are several pitfalls on this path, and it is not officially supported by Microsoft, so read the
relevant material before you begin.
Use simpler formats . If you just need to programmatically create documents for Office, there is almost always a better format, which can then be freely opened in Word or Excel.
- Use CSV to create tabular data and use it in Excel.
- CSV does not support calculations, so use the WK1 format inherited from Lotus 1-2-3 to use them. It is much easier than Excel, and opens them on time.
- If you really need to create native Excel files, take a very old version of it. 3.0 is a good choice, there were no composite documents. Save a minimal file in Excel 3.0 with only the functions you need, and use it as an example of a minimal BIFF record.
- For Word documents, use HTML.
- If you really need to create a file for Word with fashionable formatting, the easiest way to create an RTF. Everything that Word can do can be written in RTF, but this format is textual, so something in it can be changed by hands and it will work. You can create a beautifully formatted file in Word, save to RTF and use a simple text replacement.
In general, if you are not trying to create a competitor for Office that can read and write all files from Office (for which you will need thousands of hours of work), then attempts to read or write to its binary files will be the most expensive way for any task that you have a.