📜 ⬆️ ⬇️

How to pdf (images) convert to text txt file

You will say that the easiest way is to select all text in pdf, copy it to the clipboard and paste from the clipboard into a text file. And you will be right. But this is not our case. The pdf file is the result of scanning a multi-page document. Those. pdf content is text images.

image


The proposed solution is implemented under Windows-8, but with minor adjustments, I think, it can be used for Linux and OS X.
Abbyy FineReader, MS Word, MS OneNote cope with the task of converting images to text. There are also sites where the image can be converted to online: http://www.ocrconvert.com
The proposed solution uses free utilities. The priority was also work on the command line.

Convert all pdf pages to image files


If there were 2-3 pages, you could use the PrintScreen function. In Windows, there is a separate button on the keyboard for this. And in Mac OS X - a tricky key combination: you need to press the three keys Shift + Command + 4, select the desired part of the screen with the mouse, and look for the resulting file on the desktop. But if there are many pages, you need to look for another way.
')
Fortunately, there is a StduViewer program that allows this. In File → Export → As Image. In the window that appears, select the type of PNG, resolution 300 dpi, set the path where to put the resulting image files. In the template of the name of the saved file, you should change% PN% to% 0PN% for the case if there are more than 10 pages.

kolgrim99 proposed to convert a pdf document into jpg-files utility from the package xpdf , which can be used on the command line. Here is his sentence:
<< If the task is to simply gut a large PDF file with scans (or any other pictures), then you can use the utility from the xpdf set, there is a lot of it, but you need pdfimages.exe for the pictures. The syntax is something like this:
pdfimages.exe -j some_file.pdf C: \ images \

in the last argument, it is necessary to put '\' at the end of the path, otherwise it will not accept. >>

Conversion of image files pages to text


HP developed and Google opened the source code of tesseract libraries that convert images to text ( OCR ). Install the tesseract-ocr program.
To recognize the Russian language during installation, in the “Additional language data” you need to check the box for Russian.

On the command line, execute commands like:

tesseract.exe image_01.png res_01.txt -l rus 

Get the text files. You can run the command for each page manually. It's easier to run the script in python:

 import os, sys import io sPathIn = "D:/Pictures/pict" sPathOut = "D:/Pictures/txt" sCmd = "\"C:/Program Files (x86)/Tesseract-OCR/tesseract.exe\" {} {} -l rus" os.system("cd \"C:/Program Files (x86)/Tesseract-OCR\"") dirs = os.listdir( sPathIn ) for file in dirs: filename, file_ext = os.path.splitext(file) sCmdRes = sCmd.format(sPathIn + '/' + file, sPathOut + '/' + filename + ".txt") print ("run> " + sCmdRes) os.system(sCmdRes) 

It turned out a bunch of text files that remained to be merged into one. This can be done with pens. But it was easier to write a script in python:

 import os, sys import io sPathIn = "D:/Pictures/txt" sFileOut = "D:/Pictures/res.txt" dirs = os.listdir( sPathIn ) for file in dirs: filename, file_ext = os.path.splitext(file) if (file_ext == ".txt"): fOut = open(sFileOut, "ab") f = open(sPathIn + "/" + file, "rb") data = f.read() fOut.write(data) f.close() fOut.close() 

At this one could finish, because Most of the text turned out to be quite readable, but in some places in the text there was a mass of SKINS.
For example, a picture with the text

transformed into something like this:
management of the simulation process, including by temporarily interrupting, intermediate storing and restarting the simulation process from a suspended state, setting various initial conditions, introducing on-board system failures, weather conditions, time-runs, various disturbing factors (wind, turbulence, etc.);

Therefore, the next stage appeared.

Correction of errors in the text


We use the program LanguageTool . We are interested in working on the command line, so download the “independent version” . Java is required to work with LanguageTool.

I started it from my native directory (on Windows-8.1, for some reason, it did not want to work if the current directory is someone else’s) and specified the full file names (with the directory). If the command line to execute the command, for example, this:

 java -Dfile.encoding=UTF-8 -jar languagetool-commandline.jar --help 

... then an additional console will start, where it will honestly write help and safely close within a second. To see what it is writing to the console, you need to run a command bat file with this line inside. Perhaps java has another thread command line parameter so that it doesn’t run. console, but this is unknown to me.

The command to correct errors in a text file is as follows:

 java -Dfile.encoding=UTF-8 -jar languagetool-commandline.jar -a -l ru original.txt > corrected.txt 

To disable the correction of small letters to large at the beginning of the lines, additional parameters --disablecategories CASING appeared, and instead of the file name -% 1 to transfer the name to the inside of the bat-file as an argument. So, the line in the bat file is as follows:

 java -Dfile.encoding=UTF-8 -jar languagetool-commandline.jar -a -u --disablecategories CASING -l ru %1 > %1-res.txt 

For the -u argument, the string “Unknown words:” is added to the end of the corrected text file with a comma-separated list of all words that LanguageTool does not know. Thus, you can improve the text by correcting the wrong words from this list.

Python 3.5 and PyCharm were used .
Thanks for attention!

Source: https://habr.com/ru/post/314274/


All Articles