📜 ⬆️ ⬇️

What is written with a pen, or how to check documents in MS Office formats

One of the tasks frequently encountered in the process of creating applications is the generation of various documents in one of the popular formats. There are several usual ways to achieve the desired result - from connecting a ready-made library to brutal study of the format specification followed by writing the necessary code. But, regardless of the option chosen, it would be nice to check that the received document would be well perceived and edited by standard means. Some methods of such verification will be discussed under the cut.



I’ll make a reservation at once that by checking in this case I mean checking for the obtained result to the standard of the chosen format, which makes it possible to assert with sufficiently high probability that the document will open at least in Word or Excel without unpleasant reports of any problems with a proposal to try to repair the damaged file. Regardless of how serious the application being developed is and how solid the set of supported file types should be, it is likely to include standard Office Open XML formats , more commonly known as docx and xlsx, as well as their binary predecessors, doc and xls. I will talk about some of the mechanisms for working with them.

Dinosaur Walking


Although Microsoft has posted on its website the official specifications of the doc and xls formats, they are often stingy and laconic in their descriptions. It should be understood that over time the formats have undergone significant changes (while maintaining backward compatibility). In this case, a unified approach was used not only to store Word documents, but also for Excel spreadsheets and PowerPoint presentations. Simply put, even if there are manuals of opportunities to feel like Georges Cuvier, trying to restore the appearance of a mysterious little creature along scattered bones, abound . Fortunately, there is a way to localize the problems in the document without having to clear the byte porridge in your favorite hex editor.
')
This way is to use Microsoft Office Binary File Format Validator . This is a fairly simple command-line utility to use, complete with three dlls (one for each supported format - doc, xls, ppt). Despite the fact that since the first official announcement on the site, the beta version has remained laid out, the tool is quite efficient and copes with its tasks. The only significant problem that I encountered while working with the validator is the lack of support for Cyrillic file names. To run the utility, just enter the command

bffvalidator.exe [-l log.xml] filename.ext

where filename.ext is the name of the file being examined, and -l log.xml is an optional parameter indicating where to save the log (by default, the log is written to the same folder as the document being scanned).
To make life easier and reduce the number of routine actions, I use two scenarios for working with the validator. When checking a separate file, it is convenient to use Far-ohm: it’s enough to have a separate folder, for example, c: \ Temp \ Bff, put the validator and the accompanying dll-files there, and then get the command via F9-Commands-File Associations:




After that, check the suspicious file will be literally a couple of clicks on the keyboard. Another scenario that makes sense is to force the application to generate a set of test files, and then with the help of a simple code to run the scan on the entire set, for example:

 public class FileFormatValidationFailedException : Exception { public FileFormatValidationFailedException(string msg) : base(msg) { } } public void RunBFFValidator(string filePath) { string fileName = Path.GetFileName(filePath); string workingDirectory = Path.GetDirectoryName(filePath); string startupPath = Path.GetDirectoryName(Process.GetCurrentProcess().MainModule.FileName); StageName = String.Format("RUNNING BFFValidator for FILE {0}", fileName); outputManager.BeginWriteInfoLine(String.Format("Running BFFValidator for saved file '{0}'", fileName)); ProcessStartInfo startInfo = new ProcessStartInfo(Path.Combine(startupPath, "BFFValidator.exe")); startInfo.Arguments = string.Format("-l bfflog.xml \"{0}\"", Path.Combine(workingDirectory, fileName)); startInfo.WorkingDirectory = workingDirectory; startInfo.WindowStyle = ProcessWindowStyle.Hidden; Process validatorProcess = Process.Start(startInfo); validatorProcess.WaitForExit(); if (validatorProcess.ExitCode != 0) { using (StreamReader reader = new StreamReader(Path.Combine(workingDirectory, "bfflog.xml"))) { string logContent = reader.ReadToEnd(); throw new FileFormatValidationFailedException(logContent); } } } 

The second part of the Merleson Ballet


The situation with Office Open XML is much simpler. If you open several files with a hex editor, then at the beginning of the data you will see the initials of Phil Katz :


This means that the files are a zip archive renamed to docx / xlsx, which can be opened and seen quite readable structure. However, in this case, you can not try to manually search for discrepancies with the documentation from Redmond, but entrust the analysis of the file to tools specifically designed for this. To do this, download the Open XML SDK 2.5 and install it (we will need OpenXMLSDKV25.msi and OpenXMLSDKToolV25.msi). After that, you can create an application to check files for the presence of invalid markup (you will need a reference to DocumentFormat.OpenXml.dll). The simplest code for analyzing documents is as follows:

 public void RunOpenXmlValidation(string filePath, string openXmlFormatVersion) { string fileName = Path.GetFileName(filePath); StageName = String.Format("RUNNING OpenXmlValidation for FILE {0}", fileName); outputManager.BeginWriteInfoLine(String.Format("Running OpenXmlValidation for saved file '{0}'", fileName)); using (WordprocessingDocument wordDoc = WordprocessingDocument.Open(filePath, false)) { DocumentFormat.OpenXml.FileFormatVersions formatVersion = DocumentFormat.OpenXml.FileFormatVersions.Office2010; if (openXmlFormatVersion == "office2007") formatVersion = DocumentFormat.OpenXml.FileFormatVersions.Office2007; else if (openXmlFormatVersion == "office2013") formatVersion = DocumentFormat.OpenXml.FileFormatVersions.Office2013; OpenXmlValidator validator = new OpenXmlValidator(formatVersion); var errors = validator.Validate(wordDoc); StringBuilder builder = new StringBuilder(); foreach (ValidationErrorInfo error in errors) { string errorMsg = string.Format("{0}: {1}, {2}, {3}", error.ErrorType.ToString(), error.Part.Uri, error.Path.XPath, error.Node.LocalName); builder.AppendLine(errorMsg); builder.AppendLine(error.Description); } string logContent = builder.ToString(); if (!string.IsNullOrEmpty(logContent)) throw new FileFormatValidationFailedException(logContent); } } 

And we need to go deeper ...


The validation methods discussed above should be considered as an attempt to quickly search for where exactly the problems might be in the document, and not as an absolute guarantee that everything is in order. Situations when the validator gives an error message are quite real, and Word or Excel normally opens the file and vice versa - it is not possible to open the document that passed validation. Therefore, if a more reliable check is necessary, then you cannot do without using COM. This requires the installed Microsoft Office, is not thread-safe , requires additional gestures for x64, but allows you to make sure that the document complies with the requirements of MS Office and to trace its structure from the point of view of the target platform.

extracurricular reading


If you need a more serious analysis of the file with an in-depth understanding of its structure, you can refer to the documentation for the OpenXML SDK and formats directly .

Also, when examining the internal structure of documents, the OffVis utility can help.

A few useful links for interaction with office applications: Primary Interop Assemblies (PIAs) , Microsoft.Office.Interop.Excel namespace , Microsoft.Office.Interop.Word namespace .

I hope that the use of useful tools from Microsoft will save you time and nerves. Thanks for attention!

Source: https://habr.com/ru/post/270205/


All Articles