📜 ⬆️ ⬇️

Automate the conversion of word files to other formats

Some state. structures form reports in doc files. Somewhere this is done by hand, but somewhere automatically. Imagine that you are assigned to process a ton of such documents. It may be necessary to isolate any specific information or simply to check the content. We need to remove only unformatted text without graphics and pictures. For example, such data is easier to push into the neural network for further analysis.

Here are some options for the most ordinary person:


Just about the last version and will be discussed.
')
And to help us in a hurry vbs script. vbs script can be called from the command line, which can be done in any programming language.

Create a file converter.vbs

Const wdFormatText = 2 Set objWord = CreateObject("Word.Application") Set objDoc = objWord.Documents.Open(Wscript.Arguments.Item(0), True) objDoc.SaveAs WScript.Arguments.Item(1), wdFormatText objWord.Quit 

In the first line we indicate in which format we will convert: 2 - to txt, 17 - to pdf.
In the second line, we open the word directly. After opening, you can add the following line:

 objWord.Visible = TRUE 

This will lead us to see the process of opening Word. This can be useful if at some point an error occurs, the word does not close itself, and without this line the process can be killed only through the task manager, and so we can simply click on the cross.

At the command prompt, the script will run as follows:

 converter.vbs ___\_.docx ___\___ 

Wscript.Arguments.Item (0) is the full_path_to_file \ filename_.docx
WScript.Arguments.Item (1) - full_path_to_store \ name_file_free_file
Accordingly, in the third line of our script, we open the file, and on the next line we save in the specified format. And in the end we close the word.

There is another little trick needed. Sometimes due to the difference in word versions or for other reasons, a word may scold, saying that the file is damaged. When you manually open the file, we will see the warning "Is the table corrupted, continue to open the file?". And you just need to click on "Yes", but the script already at this moment will stop its work.

In vbs, a very clumsy “try catch” construct. You can work around this problem by adding just two lines. Total high-grade stable script looks like this:

 Const wdFormatText = 2 Set objWord = CreateObject("Word.Application") objWord.Visible = TRUE On Error Resume Next Set objDoc = objWord.Documents.Open(Wscript.Arguments.Item(0), True) Set objDoc = objWord.Documents.Open(Wscript.Arguments.Item(0), True) objDoc.SaveAs WScript.Arguments.Item(1), wdFormatText objWord.Quit 

As you can see, the opening of the file is duplicated. In the case when the file is all right, the file will simply open twice, and in case of an error it will simply continue opening the file.

And for every fireman, an example of how a Python function might look

 import os #folder_from = os.getcwd() + r'\words' - ,    word  #folder_to = os.getcwd() + r'\txts' - ,    def convert(file_name): str1 = folder_from + r"/" + file_name str2 = folder_to + r"/" + file_name[:file_name.rfind('.')] os.system('converter.vbs "' + str1 + '" "' + str2 + '"') #  

Further, we simply apply this function to all files that need to be converted.

Total


  1. This solution fits all word formats.
  2. You have spent no more than 10 minutes reading this article.
  3. You can implement knowing any programming language.

Source: https://habr.com/ru/post/441736/


All Articles