Hello!
In this article, we open the cycle dedicated to the study of security components of Microsoft Office. This article will focus on data formats, encryption and receiving characters.
When Microsoft thought and developed a large-scale office suite for Microsoft Office, the creators probably hoped for success. It is difficult to say whether they could count on his triumphal march around the world afterwards, that the product would become a de facto standard, and its existence would last for decades. However, one can confidently assert that the massiveness of applications, the number of man-hours spent on creating, developing, supporting the backward compatibility of product components contributed to the emergence of a “heavy legacy” in the form of outdated software code that was a core of applications even in recent versions. The requirements for the code twenty years ago have changed. Today, cross-platform, scalability and security are at the forefront. At the same time, the costs of significant changes in the product are such that Microsoft prefers the “not broken - do not touch” approach, and diligently ensures backward compatibility with the most ancient document formats. It does not do without certain pressure from commercial and state structures, which also slowly and reluctantly update their technological parks, preferring usual means to the detriment of development and security.

Rummaging through the wilds of the Microsoft Office file handlers, we are ready to present you this little research.
Component Object Model and Data Storage
It is worth starting from afar, namely from the Object Model of Components. As you must know, Microsoft loves to make its products based on this technology, and Microsoft Office is no exception.
The object model of COM components is a standard that allows software to use services provided by other software, regardless of where the software is located (within the same process, in different processes or on different computers) and what is represented (executable files, managed code, or even scripting languages). At the same time, the client program, called a container, does not need to go into the details of the implementation of the service (or component), it is enough to know its class identifier, and, if such is registered, an ingenious marshaling system will provide a transparent, seamless and fairly reliable interaction.
This is how a container application can look like heterogeneous components regardless of their location.')
In practice, it usually means not so much the standard itself, as its implementation in the OS of the Windows family. The first versions of COM were developed for 16-bit Windows as the basis for OLE (it is now ActiveX). The original purpose of developing these subsystems was the ability to create (!) Composite Word and Excel documents for Windows 3.x, and they were released around 1991 (dates differ on various sources).
Imagine a C # application that uses several ActiveX controls (ActiveX controls are COM components designed to interact with the user; the definition is inaccurate, but it seems that there is no exact definition). The container application loads an in-process component written on flat C to draw an image, a component inside another process on the same computer to load a web page and a form for entering string data that the component running on another computer draws.

Working with the application, the user changes the image, the contents of the browser window and the lines in the input fields. The application itself interacts with the components, invoking their
methods and setting their
properties , in other words, changing the internal state of objects whose internal structure has no idea.
At one point, the user decides to save the work done and presses the "Save" button. Our application is faced with a daunting task - to write a dataset to a disk in completely different formats, most of which (both data and formats) are unknown to the application, and they are out of reach for it! To solve this problem, Microsoft specialists simultaneously developed the file format
Compound File Binary Format for COM and the native Component Object Model, and with it a system of interfaces for interacting with this format and their implementations combined under the name
Structured Storage .
Structured COM Storage
For universal access of applications and components to the complex, besides closed, CFBF format, for transparent for both the container and components for replacing one format with another, library interfaces IStorage and IStream and corresponding APIs were developed. The virtual data structure accessed by the application through these interfaces is represented by a system of nested directories — Storages, each of which may contain a number of sequences of bytes — Streams, in which the data is stored.
CFBF (StructuredStorage) Virtual RepresentationInformation in streams can be stored in any convenient form, including text, image in any format, encrypted or compressed data, or even other CFBF files. It is easy to put in the stream and executable code (including malicious).
Using the appropriate APIs (see the
Structured Storage Reference in MSDN ), an application can create a file storage and provide each component with a storage of a second (third, etc.) level or stream (several streams) to save state in any format. The container does not need to know in what form the component will write its data, and the standard library will take care of the placement of information in the file. When a saved state needs to be loaded, the container opens the vault and provides the loaded components with the ability to read the streams as needed.
The creation of Structured Storage technology was pursuing the following tasks:
- eliminating the need for applications to save numerous separate files for different types of data, including to save disk space
- creation of a unified interface for working with data, facilitating the creation of applications, and, in particular, COM components working with composite documents
- the ability to save the current state of the data item at any time
- data access acceleration
The last point requires separate consideration. The development of Structured Storage was carried out at the dawn of COM (early 90s), when existing hardware resources placed high demands on the speed of complex systems, including reading and writing disk files. Therefore, the storage system should be as optimized as possible for speed. This precluded the use of, for example, text formats that require significant preprocessing. In contrast, binary formats that allow data to be copied to memory with minimal modifications took precedence.
The result was the disk implementation of Structured Storage - the Microsoft Compound File Binary Format. For a long time the format remained closed, the specifications were published by the manufacturer in 2006.
The CFBF format is a “file system within a file” and has a file allocation table (FAT), a sector table, directories, and “streams” - an analogue of disk files.
Technical representation of the CFBF formatThere are several versions of the CFBF format, all of which must be supported by the latest versions of the OS as part of backward compatibility. Despite the fact that Structured Storage is largely “inherited” and obsolete technology, they represent the natural built-in data storage system COM, and COM technology penetrates through the “user-defined” part of Windows. A significant amount of application programs, including Microsoft Office and many built-in Windows applications, were developed quite a long time ago and contain a large amount of code that requires backward compatibility, which makes it difficult to switch to modern universal open formats. As a result, the technology occupies a solid place and is actively used in modern versions of Microsoft products.
Examples include:
- Shortcuts and shortcuts
- Image and Result Cache
- Installation files (msi and msp)
- Windows Notes
Compound Binary File format in Microsoft Office applications
The document format used by Microsoft Office was also originally CFBF.
Word document opened by the utility to view Structured StorageModern versions of the package use the open, XML-based OfficeOpen XML format as the main one, but CFBF support is not being discontinued in order to maintain compatibility. It should be noted that a significant amount of code responsible for working with old document formats was developed a long time ago (about 20 years ago).
Word | .doc | Legacy Word document; Microsoft Office refers to Microsoft Word 97 2003 Document |
| .dot | Legacy Word templates; officially announced "Microsoft Word 97 2003 Template" |
| .wbk | Legacy Word document backup; referred as "Microsoft Word Backup Document" |
Excel | .xls | Legacy Excel worksheets; officially declared "Microsoft Excel 97-2003 Worksheet" |
| .xlt | Legacy Excel templates; officially declared "Microsoft Excel 97-2003 Template" |
| .xlm | Legacy excel macro |
Powerpoint | .ppt | Legacy PowerPoint presentation |
| .pot | Legacy PowerPoint template |
| .pps | Legacy powerpoint slideshow |
Publisher | .pub | Microsoft Publisher publication |
Examples of obsolete, still supported Office formatsA simple search on the websites of state structures and enterprises of the Russian Federation (government procurement, websites of administrative units) reveals a discouragingly large amount of official documentation laid out in CFBF, often created in ancient versions of Office, for example, in 2003. Let the reader carry out this experience on their own.
The use of CFBF in Microsoft Office applications is not limited to support for legacy document formats that have modern XML equivalents. Microsoft Publisher still uses only CFBF documents. CFBF includes the .msg format of Outlook messages.
If an Office Open XML format document includes OLE elements, their current state can be saved in CFBF files. In this case, the document will contain inclusions in the form of binary Compound files.
binary CFBF file inside an OfficeOpenXML documentAlthough documents from various Office applications are based on CFBF, each state storage of OLE / ActiveX elements will have its own additional format. It must be borne in mind that they are largely formed historically and have been optimized for maximum performance on weak computers.
Some published specifications of Office formats based on Structured StorageOLE RTF Storage Support
Rich Text Format is generally considered a fairly safe XML-like format. However, Microsoft experts could not fail to include support for OLE / ActiveX in their implementation. RTF documents in Microsoft applications can contain and display embedded OLE elements and should be able to save their current state. To do this, such control words as '\ obj', '\ objclass', '\ objdata' were added to the format. This allows you to supplement RTF documents with ActiveX controls registered in the system. The format for the component is provided by the OLE subsystem, the ActiveX executable code transparently uses standard IStorage and IStream interfaces. Ensuring document security rests on a container application that can use legacy legacy code and fail to take into account all the modern nuances of working with ActiveX.
Microsoft EQUATION formula display OLE component
An example of a COM Structured Storage vulnerability is CVE-2017-11882.

The vulnerability was discovered in the Microsoft Office component so ancient that the source codes of the component were lost by the manufacturer.
To save the state, the elements of the Formula Editor (Microsoft Equation Editor) used structured storage streams. Violation of data integrity in streams resulted in numerous vulnerabilities, the first of which was detected by CVE-2017-11882, found by Embedi.
EquationEditor structured storage streamDespite the fact that by default the component is denied to download for .doc and .docx documents, the element was loaded successfully from the .rtf, allowing the attacker to execute the malicious code.
After unsuccessful attempts to manually fix vulnerabilities in executable code in the absence of source code, Microsoft was forced to remove the Formula Editor component from the Office suite.
Some other legacy binary formats used in Microsoft Office
EPS graphic filter
The EPS graphic filter is an Office component that is responsible for editing EPS images. They are vector-based and are constructed using the interpretation of the internal language Encapsulated PostScript (a version of regular PostScript with some restrictions).
By virtue of its features, this language supports a wide variety of constructions and possibilities. Due to this, the memory damage vulnerabilities in the EPS graphic filter are exploited fairly easily. The richness of the language makes it possible to use the techniques of HeapSpray (for example, the possibility of using cycles) and HeapFengShui (predictability of memory allocation by the interpreter). Even despite the fact that the image is rendered on a virtual printer, and the execution of the EPS program takes place within an isolated interpreter, the presence of favorable exploitation opportunities for vulnerabilities and the old code base made EPS the most common attack vector of office applications.
Due to the fact that the module was originally developed by Access Softek, and then transferred to Microsoft, a significant number of “unknown” vulnerabilities were found and successfully exploited in this component. For example, in April 2017 by FireEye Inc. vulnerabilities CVE-2017-0261 and CVE-2017-0262 were found. These two memory corruption vulnerabilities allowed attackers to build READ / WRITE primitives, with which they achieved the execution of their code outside of the isolated process (sandbox) of the PostScript interpreter. Attackers can read and write arbitrary chunks of memory in the address space of the vulnerable process, and can also perform, for example, search for the necessary ROP gadgets to build a ROP chain that makes the rest of the shell code executable.
In both cases, the attackers achieved the execution of arbitrary code in a similar way: they created an object in memory with controlled content (it was possible to do this using R / W primitives) and called one of its methods using the PostScript function.
These vulnerabilities in the EPS graphic filter have become a popular attack vector. And so much so that Microsoft in April 2017 developed an update that completely disables the graphics filter. However, the patch is applicable only for the version of MsOffice 2010 SP2 and higher.
Access databases
The Microsoft Access database management system is a powerful tool for managing a relatively small amount of data, for example, the inventory of an organization. Access allows you to easily create reports based on information in the database. The application can also be used as a front-end for managing other DBMS, including Microsoft SQL Server (using ODBC drivers).
The application and database format was developed a long time ago and have a number of architectural flaws:
- using VBA macros as some triggers and stored procedures;
- the ability to use links to other databases;
- closeness of the database format prevents the use of existing databases in other environments.
The first drawback is very serious, since VBA-macros are equivalent in their capabilities to ordinary executable files. For this reason, using Access can be a security issue.
The user must trust the database with which he works, and be sure that it does not contain malicious code contributed by the attacker. Otherwise, the ban on the execution of macros significantly reduces the functionality of the application for working with data in tables and views.
Outlook Personal Folder Files
The Microsoft Office mail client uses its own storage file format for messages, a custom folder structure, attachments, address book, and so on. This is a multi-level format, closely related to the MAPI subsystem, which provides access to personal folder files through its own interfaces.
The Outlook Personal Folders (.pst) specification was published by the manufacturer:
https://msdn.microsoft.com/en-us/library/ff385210.aspx .
The .ost specification has not been published, and formally accessing the Offline Storage Table file can only be done through MAPI. In fact, these formats are very similar and editing .ost is also possible. It must be borne in mind that synchronization of the edited content of the Offline Storage Table file with the data on the Exchange server may lead to irreversible data corruption and loss of significant information.
OwnerFile File
The problem of collaborating on Office documents located in network storages was once solved with the help of temporary files of a simple format, the so-called OwnerFile. If the file is currently locked for editing, the application searches the same directory for a file with a short format name “~ $ name.doc”. The file contains the name of the user who opened the document in ASCII and Unicode formats, in both cases the fixed size of the array is reserved for the name. When creating a file, unused bytes of the array are filled with garbage values ​​from the application's memory, which can potentially lead to the disclosure of sensitive information (in general, due to the file size, the likelihood of this is small). The username in the owner file is also easily forged.
MicrosoftWord message when trying to open a locked documentOffice document encryption mechanism
The mechanism of password protection of documents first appeared in Office 95. At that time, little attention was paid to the strength of the encryption algorithms used, as a result, algorithms were applied to which practically applicable attacks existed. This fact was the impetus for changing the mechanism in future versions of office suites.
The table shows in chronological order the most common office packages currently available and the default encryption algorithms used in them.
Version | Hashing | Encryption |
Office 2003 | None | RC4 |
Office 2007 | SHA-1 x 50.000 | AES-128 |
Office 2010 | SHA-1 x 100,000 | AES-128 |
Office 2013 | SHA-512 x 100.000 | AES-128 |
Office 2016 | SHA-512 x 100.000 | AES-128 |
Despite the use of strong encryption algorithms, the document itself is not encrypted immediately after setting the password on the document, but only after the next document is saved. Given the number of attacks based on the negligence of users of the office suite, this is quite an important nuance.
Also, we must not forget that the implementation of cryptographic algorithms is a laborious task, even for highly skilled developers, so the presence of errors in them cannot be ruled out. A striking example of such an error affecting the protection of Excel documents is a key generation vulnerability fixed by patch MS15-110.
A few words about the sources of information for the researcher
If you decide to seriously look under the hood of Office, any additional information that helps you understand the purpose of data structures will be useful.
- The already mentioned format specifications are very useful in this regard, courtesy (albeit under some pressure) laid out by Microsoft. They can be found on the MSDN website:
https://msdn.microsoft.com/en-us/library/cc313105.aspx
It should be noted that this documentation contains gaps and inaccuracies, so that focusing solely on it to write a parser of documents is quite difficult. But from it you can gather information about many mysterious structures and identifiers, find out their name. - VisualStudio 2010 SDK
Among the set of header files there is a kit for Microsoft Office, containing information about the interfaces, types, enumerations and formats of the procedure call: https://www.microsoft.com/en-us/download/details.aspx?id=2680 (the installed VS2010 is required ).
You can see (for example) here. - Office 97
In executable files of modern versions of the package, shared library procedures are imported by ordinals. In earlier versions of Office, they have distinct names by which you can get an idea of ​​their purpose. The numbers of ordinals mostly correspond to modern ones (there are differences!). - Office 2010
Office 2010 executables contain Dynamic Class Identification (RTTI) information that allows you to set class names and virtual interface tables. To do this, you can use tools like Class Informer. Beginning with Office 2013, this information is encrypted. - offparser.dll
A 64-bit Server 2003 component that contains simplified implementations of classes and interfaces. Unlike Office, symbolic information (.pdb) for this module is available for download from the Microsoft server. Allows you to get the names of the methods, the internal structure of instances of classes and the names CLSID and GUID - OutlookExpress / Microsoft Mail
The mail client and the mail subsystem included in the operating system. They contain a simplified version of MAPI (Outlook mail subsystem). As with other Windows components, symbolic information is available for download. Some of the code for these components is contained in the published Windows source codes.