📜 ⬆️ ⬇️

Automation of obtaining information from Incorporation using Freepascal



In my work (legal) I am ready to automate everything that only gives in to this. But while the robots pumped by neural networks from the utopia of German Gref did not appear and did not take away all the work of ordinary lawyers, the routine will remain our main companion for a long time. Automation of this routine is what I occasionally do for the past years, be it numerous excel tables with a bunch of formulas allowing you to quickly print out hundreds of similar documents sent to the word, well, or automatically generated reports. But there are some things that you cannot do with simple formulas and substitutions. This is where programming comes in, which I have been fond of since childhood, and it just so happened that it started with delphi. Now it is easier for me than in C # or python, to master which I started recently, to quickly do some kind of project in Lazarus using freepascal. And yes, I quite seriously believe that the possibilities of this environment are more than enough. Therefore, automate the Unified Statements, you guessed it, with the help of pascal.

A lawyer of a consulting firm doing business with dozens of legal entities, a corporate lawyer on free bread, and any other lawyer confronted with ensuring the activities of organizations - they all know how easily dozens and hundreds of different names, TIN numbers, OGRN numbers are mixed together it is easy to forget who is where the manager is, and when the term of extension of office is suitable for him, whether there are any problems with shares in an LLC and with payment of its share capital. Well, the need to quickly make a document that includes many constantly changing details, leads to periodic errors and typos. To automate just such processes, I needed a solution with a database that allows you to create documents using templates, keep various registries, track changes and not skip any deadlines. Well, one of the necessary simplifications of life is the quick receipt of a fresh file with information from the Unified State Register of Legal Services from the Federal Tax Service website . Of course, no one says that using the site directly is long and difficult, but agree that clicking on one button without leaving the application is much more fun, and you can do it without interrupting a phone call (or a cup of coffee).

So, first we decide what we want to receive. The site allows you to search in the official database of unregistered for the unique number of OGRN or TIN and give one relevant result in the form of a brief reference about the person and a link to download a pdf-file with an extract. Also, the search may be fuzzy by name with an additional filter by region (subject of the Russian Federation). And in this case, the site produces a table with all suitable persons and with the same data set, including links to pdf.
')

This means that in a specific case, the finished function must return the pdf as a file (or better, a stream), having a face at the entrance to the OGRN or TIN. But for universalization and the possibility of further expansion, we will not neglect all the capabilities of the site and will also do a fuzzy search function with the return of the data set found by the name of the organization with or without a filter by region. Let's try to describe the interfaces of these functions:

IEGRULstreamer = interface procedure GetExtractByOGRN(OGRN: string; ; isLegal: boolean; var Extract: TStream); procedure GetLegalsListByName(Name, Region: string; ; var LegalsList: TCollection); end; 

In order to understand what the mysterious parameter X and the collection of which will be returned by the second function, let us figure out exactly how the site executes the request.

1. The site contains a form with input fields for search identifiers and captcha checks:



2. A captcha is formed using a previously generated hidden field called captchaToken, which uses a java script to generate a captcha image on a given token.

3. After clicking on the "find" button, a POST request is sent to the server, in the processing results of which JSON is returned with an array of objects. This JSON response uses a different java script that fills the table, which we see in the search results.

So, the first snag is a captcha test. In order not to burden our methods dealing with interaction with the site, with unnecessary functionality, we will put the captcha processing actions into a separate function. And in X, we will have a parameter for the callback method, which has a stream with an image of a captcha at the input, and a line with a recognized captcha at the output:

 TCapthcaRecognizeFunc = function(Captha: TStream): string of object; ... procedure GetExtractByOGRN(OGRN: string; CaptchaFunc: TCapthcaRecognizeFunc; isLegal: boolean; var Extract: TStream); 

The captcha processing function can do it in any way: let the user enter it manually, send the image to the paid automatic recognition server, and self-recognize using the unique know-how of the algorithm. For simplicity of the picture, and since in my case the flow of captcha on an industrial scale is not expected, we choose the first option:

 function TForm1.RecognizeFunc(captcha: TStream): string; begin CaptchaImg.Picture.LoadFromStream(captcha); Result := InputBox('','    ', ''); end; 

The second question is the contents of the server JSON response. Here is an example of what comes in it:

The answer is in the formatted JSON format
 { "query": {"captcha":"382915", "ogrninnfl":null, "fam":null, "nam":null, "otch":null, "region":null, "ogrninnul":null, "namul":"", "regionul":"73", "kind":"ul", "ul":true, "searchByOgrn":false, "nameEq":false, "searchByOgrnip":true}, "rows": [ {"T":"ED346E713D4A1AC851F9B589C6D2AECD1D809D5B6B5D1B98E697B6E0FD873E137B828AC59A60D159BB2894F11D00AB5639E2ACEE4E2ED5B7AC7A6EFE28FD987BC288B93C4D3D3EC1008DA0F128BA7E5E", "INN":"7325001144", "NAME":"  ", "OGRN":"1027301175110", "ADRESTEXT":"432017,  ,  ,  , 1", "CNT":"4", "DTREG":"03.12.2002", "KPP":"732501001"}, {"T":"2ECB284C7682E5F1D1129AA3074FABB4B74BB28EA426AF79C091CEDEA0D9E391CA26FF405A7C9742466E19C78FBE5A59BDCBCD21268FFD8AFD3A8509CCA84541", "INN":"7303007375", "NAME":"      \"   \"", "OGRN":"1027301173283", "ADRESTEXT":"432063,  ,  ,   , 7", "CNT":"4", "DTREG":"27.11.2002", "KPP":"732501001", "DTEND":"01.09.2010"}, ] } 


As you can see, the result returns a “query” object, which contains the original search parameters (so that they remain in the form fields for reuse) and an array of “rows” objects. The link to the pdf file is combined by a java script with the expression:
  "https://egrul.nalog.ru/download/" 
and the key value "T" of the object. The lifetime of the generated pdf file is a few minutes.

The two main difficulties I encountered when creating an http request were the correct header values ​​and combining the string with the POST request parameters. But a simple analysis of the page using the built-in browser tools (in chrome, are called by pressing F12) gave everything you need. Here is an example of headers with which the server gives the correct answer instead of 400 Bad request:

 POST / HTTP/1.1 Host: egrul.nalog.ru Connection: keep-alive Accept: application/json, text/javascript, */*; q=0.01 Origin: https://egrul.nalog.ru X-Requested-With: XMLHttpRequest User-Agent: Chrome/67.0.3396.99 Safari/537.36 Content-Type: application/x-www-form-urlencoded Referer: https://egrul.nalog.ru/ Accept-Encoding: gzip, deflate, br Accept-Language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7 

But the line with the parameters:

 kind=ul&srchUl=name&ogrninnul=7716819629&namul=%D0%BF%D1%80%D0%B0%D0%B2% D0%B8%D1%82%D0%B5%D0%BB%D1%8C%D1%81%D1%82%D0%B2%D0%BE&regionul=73 &srchFl=ogrn&ogrninnfl=&fam=&nam=&otch=&region=&captcha=449023&captchaToken=DAEDA 7504CACAC82CF09E08319B68DF5F9BD62B2F44D33DD679DDE55B5CF58B17FEC84E78CEEB9639 84D2B2BD8C3AA15 

Armed with these initial data, we proceed to the implementation of the task. I will use the following libraries for freepascal:

Synapse is a very convenient library with the most simplified (for use) function of sending http requests to the server, it also works with SSL, but this requires the presence of openSSL libraries in the project folder or system, as well as the connection of an additional module. It is enough to connect the following library modules to our project: httpsend, ssl_openssl, synautil.

The built-in fcl-json library is the necessary modules: fpjson and fpjsonrtti - for maximum convenience in processing returned objects in JSON.

Separate modules of the built-in library fcl-xml - for some functions, it will be necessary to work with parts of HTML as DOM objects, therefore we will connect the modules SAX_HTML, DOM_HTML, DOM.

We describe the types and classes of objects that eventually turned out:

 TEGRULItem = class(TCollectionItem) private fT, fINN, fNAME, fOGRN, fADRESTEXT, fCNT, fDTREG, fDTEND, fKPP: string; public function GetPdfLink: string; published property T: string read fT write fT; property INN: string read fINN write fINN; property NAME: string read fNAME write fNAME; property OGRN: string read fOGRN write fOGRN; property ADRESTEXT: string read fADRESTEXT write fADRESTEXT; property CNT: string read fCNT write fCNT; property DTREG: string read fDTREG write fDTREG; property DTEND: string read fDTEND write fDTEND; property KPP: string read fKPP write fKPP; end; 

In this class, we will pack objects that will be returned in the rows array in the server's JSON response. We will read them with the help of JSONToCollection, but for this we need to make each object a member of the collection and declare all related properties as published. RTTI functions in freepascal (as in delphi) get access to the names of properties only when they are declared in such a scope. And the JSONToCollection function from the fpjsonrtti module is just an RTTI function that matches the names of keys from a JSON object with the names of the class properties.

Also in the class interface there is a function GetPdfLink, which returns a link for downloading a pdf-file with information from the Unified Statements Incorporation using the concatenation of the web-address and the value of the property "T".


The main class implementing the interface declared above will be as follows:

  TEGRULStreamer = class(TInterfacedObject, IEGRULStreamer) private HTTPSender: THTTPSend; Doc: THTMLDocument; Inputs: TDOMNodeList; captchaURL, captchaToken, captcha, Params: string; function GetCaptchaToken: string; function GetLegalsList: TCollection; procedure PrepareHeaders; procedure ProcessCaptcha(CaptchaFunc: TCapthcaRecognizeFunc); public procedure GetExtractByOGRN(OGRN: string; CaptchaFunc: TCapthcaRecognizeFunc; isLegal: boolean; var Extract: TStream); procedure GetLegalsListByName(Name, Region: string; CaptchaFunc: TCapthcaRecognizeFunc; var LegalsList: TCollection); destructor Destroy; override; end; 


As you can see, in addition to the implementation of the two main functions of the interface, all other properties and methods of the class will be hidden and are needed only for the internal implementation. In general, they could be included into the main methods, but we have already passed lessons about duplicate code, visibility and refactoring in general.

Taking into account the encapsulation of preparatory actions, the main methods in general will differ only in the formation of the parameter string of the HTTP request and the returned data type.

method code TEGRULStreamer.GetExtractByOGRN
 procedure TEGRULStreamer.GetExtractByOGRN(OGRN: string; CaptchaFunc: TCapthcaRecognizeFunc; isLegal: boolean; var Extract: TStream); begin ProcessCaptcha(CaptchaFunc); if isLegal then Params := 'kind=ul' else Params := 'kind=fl'; Params += '&srchUl=ogrn&srchFl=ogrn&ogrninnul='; if isLegal then Params += OGRN; Params += '&namul=&regionul=&ogrninnfl='; if not isLegal then Params += OGRN; Params += '&fam=&nam=&otch=&region&captcha=' + captcha + '&captchaToken=' + captchaToken; WriteStrToStream(HTTPSender.Document, Params); if not HTTPSender.HTTPMethod('POST', EGRUL_URL) then raise Exception.Create('   '); HTTPSender.Headers.Clear; if HTTPSender.HTTPMethod('GET', TEGRULItem(GetLegalsList.Items[0]).GetPdfLink) then Extract := HTTPSender.Document else Extract := nil; 


Here, as we see, the method also uses the logical parameter isLegal, and if it is not set to true, the search goes on the basis of entrepreneurs instead of legal entities.

method code TEGRULStreamer.GetLegalsListByName
 procedure TEGRULStreamer.GetLegalsListByName(Name, Region: string; CaptchaFunc: TCapthcaRecognizeFunc; var LegalsList: TCollection); begin ProcessCaptcha(CaptchaFunc); Params := 'kind=ul&srchUl=name&srchFl=ogrn&ogrninnul=&namul='; Params += Name + '&regionul=' + Region + '&ogrninnfl=&fam=&nam=&otch=&region'; Params += '&captcha=' + captcha + '&captchaToken=' + captchaToken; WriteStrToStream(HTTPSender.Document, Params); if not HTTPSender.HTTPMethod('POST', EGRUL_URL) then raise Exception.Create('   '); LegalsList := GetLegalsList; end; 


The role of service methods is as follows:

ProcessCaptcha - loads the initial html page of the FTS service, searches for the captcha token, downloads the image generated by this token, and redirects it to the callback-method for captcha recognition. At the end, the method also sets the correct headers for the subsequent POST request.

GetCaptchaToken - loads all input fields from the page into the DOM structure, searches for a hidden field with the identifier capthcaToken and returns its value.

GetLegalsList - using the RTTI function, the JSONToCollection returns a collection of objects of type TEGRULItem, described above.

GetPdfLink - to search by OGRN or TIN, in the right case, only one result will always be returned, therefore in GetExtractByOGRN the function is called for the first element in the collection.

Since this is my first experience with the network in freepascal, I am very glad that everything turned out exactly as I intended. In working form, the library was made in less than one day (thanks to the members of freepascal.ru who told about synapse).

The archive with the test of the resulting library and its code is here .

As always I will be glad to any constructive criticism of both the project and the implementation. I understand that there are many factors that can still be taken into account: a delay in responding to an http request, as a result of which the application will hang; Incorrect http responses and other situations.

In the future, I plan to connect the online library with the FIAS address database and to implement the ability to generate completed application templates, which are generally edited in the Program of preparation of documents for state registration .


PS Sorry, Sberbank, for the role of a guinea pig and downloaded statement hundreds of times. All in the name of science of course.

Source: https://habr.com/ru/post/419063/


All Articles