We check information about unreliability in extracts from the Unified State Register of Legal Entities by gluing them in pdf to python

At present, the very topical issue is the possibility of the tax authority to exclude a company from the Unified State Register only by “revealing” so-called false information about the company. As shown by statistics from September 2018, the Federal Tax Service has excluded 90,000 organizations from the Unified State Register of Companies with a record of unreliable information about a manager, founder or legal entity address. It is possible to detect the fact that there is inaccurate information about the company only by looking at an extract from the Unified State Register of Legal Entities.

It looks something like this:

')
The problem is aggravated by the fact that the data on the inaccuracy can appear both at the request of the interested person and “by themselves”, as a result of the actions of the tax authority. To protect yourself from sudden departure from the Unified State Register of Extracts, you must receive a statement regularly. How to do this quickly and painlessly in the presence of a large number of companies in the holding, we dismantled in the previous post .

At this time, we analyze how to look for information about the inaccuracy in the extracts of the register.

We assume that we have the n-th number of statements that we downloaded from the FTS website. Extracts have the extension .pdf and some names.

All that is required of us is to search for the word “lacking” in each pdf file.

It is not our method to open every pdf file with statement and search. This may take a lot of time. You can glue all the files in Abbyy Finereader, but it also takes enough time.

Let's write a program that will merge all pdf files into one. Python allows you to do it in seconds!

In the future, we will be able to open the created file and search for the required word at once for all extracts from the Unified State Register.

Let's start.

* Excerpts from Incorporation are located in the C: \ 1 directory.
In the new python file we import modules for working with pdf and the system as a whole:

import PyPDF2, os

Next, create an empty list and move to the C: \ 1 directory, which will contain all of our statements.

This directory does not have to be empty. In the program, we provided for the processing of only those files that have the pdf extension:

 pdfFiles = [] os.chdir('C:\\1') for filename in os.listdir('.'): if filename.endswith('.pdf'): pdfFiles.append(filename) pdfFiles.sort()

The next block glues the statements together, adding each subsequent statement to the end:

 pdfWriter = PyPDF2.PdfFileWriter() # Loop through all the PDF files. for filename in pdfFiles: pdfFileObj = open(filename, 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # Loop through all the pages and add them. for pageNum in range(0, pdfReader.numPages): pageObj = pdfReader.getPage(pageNum) pdfWriter.addPage(pageObj)

It remains only to save the result:

 pdfOutput = open('all.pdf', 'wb') pdfWriter.write(pdfOutput) pdfOutput.close()

So, after the work of the program, we received the file all.pdf, by which you can already search for the required information about the unreliability of information.

Download the program for gluing pdf into one - here .

Source: https://habr.com/ru/post/456060/

All Articles

We check information about unreliability in extracts from the Unified State Register of Legal Entities by gluing them in pdf to python

More articles: