For the second year I have been doing public open data in the Russian Federation and working with government agencies and it’s time to start telling interesting stories about how the data appears. However, today we will speak about a more familiar area for the developer - data parsing for the “Declarator” project and what unexpected benefit the open data can bring.

The “declarator” is a constantly updated database of income and property declarations of public officials: deputies, officials, judges, representatives of regional and municipal authorities, other bodies, state corporations and a state-owned company. The project works as an information base for media, activists involved in public control, and researchers.
In Russia, income information must be published by more than a million people.
')
An interesting fact: there are uniform rules for state sites on the placement of income declarations (in particular, they are always in the section “Countering Corruption”) and the Ministry of Labor and Social Protection of the Russian Federation is responsible for this whole topic. Mass posting of declarations takes place in May. Next, the Ministry of Labor has only a month to monitor all sites that are obliged to post information. Monitoring is done manually.
There are several problems associated with the publication of declarations:
- each agency does this on its own website;
- there is no single standard for the publication of declarations;
- information may be deleted shortly after publication;
- Being generally accessible by law, declarations are still not published in machine-readable form.
MaterielAccording to clause 2 of the Procedure for placing information about incomes, expenses, property and property obligations of certain categories of individuals and their family members on the official websites of federal state bodies, state authorities of the constituent entities of the Russian Federation and organizations and to provide this information to the all-Russian mass media for publication, approved by the
Decree of the President of the Russian Federation dated July 8, 2013 No. 613 “Anti-Corruption Issues” , federal state authorities Ghanaian authorities, Central Bank of the Russian Federation, Pension Fund of the Russian Federation, Social Insurance Fund of the Russian Federation, Federal Mandatory Medical Insurance Fund, state corporations (companies), and other organizations established under federal laws are required to publicly disclose income information for the past calendar year .
Those interested can read the details in the
order of the Ministry of Labor 530n , and briefly: the section “Countering Corruption” should be in one click from the main page and contain several mandatory subsections with hard-coded names, among which we are interested in one - “Information about incomes, expenses, property and property obligations. ” It is in it, without restriction of access, in tabular form, including in the formats .doc, .docx, .excel, .rtf (clause 15 of the Requirements), declare the income of state officials.
At the same time, as stated directly in the document, “it must be possible to search through the text of the file and copy fragments of text” —that is what makes it possible to create parsers.
About declarations
The declaration of income of each official contains information about the objects of real estate owned, used, information about vehicles, information about income and sources of income. The declaration lists not only officials, but also members of their families. If you are lucky, the summary file is created using
the income information form template proposed by the Ministry of Labor, which greatly simplifies the task of writing a universal parser.

The first object of interest was the Ministry of Education and Science. The trial
parser , for
income declarations of federal civil servants for 2014 for 306 people, was written quickly and fairly painlessly. After that, we decided to move on to what usually causes the greatest number of requests from researchers - information about the income of rectors of universities. And here difficulties began with
declarations for 2014 of subordinate institutions of the Ministry of Education and Science .
A quick analysis showed that only the rectors in this file are 272 people and they obviously need to be somehow grouped. Group decided by region. However, the declaration indicates only the position and name of the institution.

Almost three hundred unique institutions of higher education did not want to look for them at all. It was here that the open data of Rosobrnadzor called
“Register of organizations carrying out educational activities on accredited educational programs” were useful (
upd: the name and link was corrected 05/24/2016 ). The registry is very informative, but it has a rather complicated format and deserves a separate article. To the credit of Rosobrnadzor, it is presented in xml format instead of the traditional csv for government agencies, and does not contain errors in the structure, so it was possible to use it as a database without any preliminary processing. (the link as of May 16, 2016 is temporarily not working, so the registry so far can be taken
here ,
upd: Rosobrnadzor fixed the link 05/24/2016 ).
About the task
So, the initial task is to extract data from rectors from the income declaration, define a region for them and generate
special format xml files for the plugin Filling data through which the data is downloaded to the Declarer's website.

The plugin simulates user actions on the site by automatically filling out forms. And as it turned out during the testing process, it has limitations on the number of records that it can process at one time ...
At the entrance, we have
a .doc file containing income data for 1561 officials. Since we need data not only for officials, but also for their family members, in fact we need to process information on 3347 people.
Implementation
You can see the details of the implementation of the declaration parser on the
githaba , I will only say that it took a lot of preprocessing to remove unnecessary and often unexpected characters:
@ "[\ r \ n \ a \ b \ u000b]" , as well as shamanism with vehicles, until now not finished due to such cases (this is one table cell in MS Word):

Registry
We now turn to the main task - the match of the names of universities declarations and the Rosobrnadzor registry. Its solution with the selection of regular expressions as a result took much more time than writing the parser itself. Although the code there turned out much less.
The entries for each license in the Rosobrnadzor registry have (in a very abbreviated form) the following structure:
<Certificate> <RegionName>. </RegionName> <RegionCode>77</RegionCode> <EduOrgFullName> « »</EduOrgFullName> <EduOrgShortName> «»</EduOrgShortName> <ActualEducationOrganization> <Id>58302c2c-16f2-0772-3cf1-ebacbde89ecd</Id> <FullName> « »</FullName> <ShortName> «»</ShortName> <RegionName>. </RegionName> <RegionCode>77</RegionCode> </ActualEducationOrganization> <ActualEducationOrganization> … </ActualEducationOrganization> </Certificate>
EduOrgFullName is information about the license itself, for the parent organization. The tags
ActualEducation Organization contains information on all institutions subordinate to the parent organization (and there may be a lot of them - the institutes of the university, branches and much more). Therefore, the solution looked simple and obvious: find the matching names from the declarations for the
FullName or
ShortName tags of the registry and find out which region corresponds to them.
Registry features: “Christmas trees” are used exclusively as quotes, the name of the type of institution (“federal state ...”) in the FullName tag is written entirely in lower case.
Preprocessing
However, apparently, the format for writing the names of universities in the declarations was limited only by the imagination of those who filed the declarations. As a result, we have a variety of registers and spellings, from the
Federal State Budgetary Institution of Higher Education "Moscow State University of Technology and Management named after K. G. Razumovsky (PKU) " and
FSAEI HPE" National Nuclear Research University "MEPhI" before FSAEI "VPO BFU IM.I.KANTA" (and in another place of the declaration it was already recorded as
FSAEI HPO "Baltic Federal University named after Immanuel Kant" ). I could not find my university for a long time with a parser. Not surprising for
FGBOU VPO "MGUDT" ...
A great variety was observed in writing quotes. The declarations mostly contain direct “English” quotes instead of the usual Russian Christmas trees, the most damning option for parsing was
“St. Petersburg State Electrotechnical University“ LETI ”them. IN AND. Ulyanova (Lenin) " (it looks like the Habra parser didn’t master it either). Exotic
" "" (Volgograd, Moscow) met twice, and in Rostov they thought of the
Rostov State Economic University (RINH) >> version (between By the way, 10 occurrences in the declaration).
So, it turned out that XPath in xml is looking for very quickly, able to find partial matches, but alas - only in full accordance with the register of letters. If you override the search function, the problem disappears, and with it the speed disappears as well.
The first step was to remove the repetitive part — the type of institution — and leave only the actual name. And here the discovery awaited: in addition to the FSBEI (as well as the FSBEI, FSBE) VPO, VET, and VO, it turned out that there are still institutions of “inclusive higher education,” that is, FSBEII of HE ... be replaced by a “federal state budgetary institution of higher professional education” - can you find a difference?
As a result, this is the construction on regular expressions:
orgname = Regex.Replace(orgname, @"(.*)((|)\s)((|)?)?", ""); orgname = Regex.Replace(orgname, @"([-]*\s|[-]*\s)?([-]*\s|[-]*\s)?(.*)?([-]*\s|[-]*\s)([-]*\s|[-]*\s)?([-]*\s|[-]*\s|[-]*\s|[-]*\s)([-]*\s|[-]*\s)?([-]*|[-]*)", ""); orgname = Regex.Replace(orgname, @"([-]*\s|[-]*\s)?([-]*\s|[-]*\s)?(.*)?([-]*\s|[-]*\s)", "");
And the order of commands here is of great importance. Well, the fight with quotes:
if (orgname.Contains("<<")) orgname = Regex.Replace(orgname, @"(.*<<)(.+)(>>.*)", "$2"); if (orgname.Contains('«')) orgname = Regex.Replace(orgname, @"(.*«)(.+)(».*)", "$2"); if (orgname.Contains('“')) orgname = Regex.Replace(orgname, @"(.*“)(.+)(”.*)", "$2");
The big problem turned out to be with universities that were named after someone, because:
- could meet options "them." and "name" and in various registers;
- the name itself could be written in any variant in any register and with any number of spaces (or without them), for example: IM.I.KANTA, named after Ivan Fedorov, it. IN AND. Ulyanov (Lenin), im.NI Lobachevsky.
Due to the unpredictability of writing, it was easier to remove the “name of someone” entirely:
orgname = Regex.Replace(orgname, "(.*)", ""); orgname = Regex.Replace(orgname, @"(\..*)", "");
Fortunately, the name of the university at the same time did not lose in uniqueness.
Still, there remained “stubborn” cases that did not want to be in any way. These were universities, whose names consisted of abbreviations: LETI, NINH, STANKIN and again my favorite MSUDT. Decision:
Match tempmatch = Regex.Match(orgname, @"[-]{2,}"); tempname = orgname.Substring(tempmatch.Index, tempmatch.Length);
Strangely enough, even after that several dozen universities were unrecognized. The analysis showed that some of them were written with typos, mainly because of the omission or transposition of letters. The leader was a typo "Moskvosky". And in Kuzbass, judging by the capsules, they love their university very much, but their literacy let them down:
“The Federal State Budgetary Institution of Higher Professional Education KUZBASS STATE TECHNICAL UNIVERSITY AFTER T.F. Gorbachev .
There were many problems with spaces:
- In the Rosobrnadzor register, there are double spaces in some names;
- In the names of some universities there is a dash. And these dashes absolutely differently could be combined with spaces. For example, the declaration contains the “Volgodonsk Engineering and Technical Institute - a branch of the Federal State Autonomous Educational Institution of Higher Professional Education“ National Research Nuclear University MEPhI ” , in the register the same name looks like “ Volgodonsk Engineering and Technical Institute ... ” . At the same time, there was the following case: “The Federal State Budgetary Educational Institution of Higher Professional Education" State University - Educational, Scientific and Industrial Complex " . "
But the number of unidentified universities for some reason was in no hurry to decline. And it was a completely unexpected problem, which eventually led to the modification of the search algorithm. It turned out that the license for educational activity could be registered for one institution name, and the official name of the university slightly, but different. For example, there were couples "academy - university", "institute - university".
In the registry, it looked like this:
<EduOrgFullName> « - »</EduOrgFullName> <EduOrgShortName> «»</EduOrgShortName>)
and
<FullName> « - »</FullName> <ShortName> « - », «», </ShortName>
And in the declaration could be any of them. For a reason not known yet, several universities did not have the FullName and ShortName tags at all.
Separately, about the Crimea:
"Federal State Autonomous Educational Institution of Higher Education" Crimean Federal University named after VI. Vernadsky " did not have in the registry indications of the region.
The final
version of the parsing algorithm currently produces these
results .
An example of a finished
file for the Filer (universities for the city of Moscow).
Some statistics
In the Declaration of the Ministry of Education and Science 1561 employees of organizations, together with family members of their 3347 people, in the file 770 pages, unexpectedly 32 tables.
272 rectors, i.e. over 272 unique universities, of which 8 universities were written with errors or typographical errors.
4 universities do not appear in the register of Rosobrnadzor. It:
- Federal State Budgetary Educational Institution of Higher Education "Arctic State Institute of Culture and Arts";
- Federal State Budgetary Educational Institution of Further Professional Education "State Institute of New Forms of Education";
- Federal State Budgetary Educational Institution of Further Professional Education “Institute of Continuing Adult Education”;
- Federal State Budgetary Educational Institution of Further Professional Education "Novomoskovsk Institute of Advanced Training for Executives and Chemical Industry Specialists".
In conclusion, once again
reference to the project . The parser will work on the declarations of other government agencies created on the same template.
useful links
NPA:
Presidential Decree of July 8, 2013 № 613 "Anti-Corruption Issues" , Art. 6 about the responsibilities of the Ministry of Labor and Social Protection.
Decree of the President of the Russian Federation of July 8, 2013 No. 613 “Anti-Corruption Issues” , Procedure for placing information about incomes, expenses, property and property obligations of certain categories of individuals and their families on the official websites of federal government bodies, state authorities of subjects Of the Russian Federation and organizations and the provision of this information to the all-Russian mass media for publication about the duties of the Ministry of Labor and Social Protection, paragraph 4 “Svede information on income, expenses, property and property obligations ... are annually updated within 14 working days from the date of expiration of the deadline set for their submission. ”
Presidential Decree of July 8, 2013 No. 613 “Anti-Corruption Issues” , Article 8. Obligation to provide information on income, property and property obligations.
All the
regulations on
anti-corruption legislation .
NPA Ministry of LaborProgramming:
Regular Expression Checker ServiceHelp on regular expressions in C #Classes of regular expression characters for C # , as well as a project with a simple implementation of a non-register xml search:
link .