📜 ⬆️ ⬇️

Information for administrators of SED "Appeals of citizens"

On March 2, 2017, a working group under the Administration of the President of the Russian Federation approved a new standard all-Russian thematic classifier of citizens' appeals, organizations and public associations. The file is sent to the regions in PDF.

For those who will parse the text, and rarely work with re, I recall the expression for Python:
Source list:
...
0002.0013.0140.0282 Management in the field of scientific and technical activities
0002.0013.0140.0282.0006 Other sub-questions
...
r '((([\ d] {4} \.) {3,4} [\ d] {4}) ([\ s \ S] +?)) [\ d] {4} \.'
returns a piece of text
having a beginning (question code):
(([\ d] {4} \.) {3,4} [\ d] {4}) - three or four groups of 4 digits with a dot + 4 more digits

middle:
([\ s \ S] +?) - question text
')
and end:
[\ d] {4} \. - 4 digits with a dot (code of the next question)


match.groups () [1] - question code
match.groups () [3] - question text

Do not forget that
re is not looking for overlap, re.findall and re.finditer will return only odd questions,
re does not look for the end of the file, the last question will disappear (correct, if I'm wrong).

You can pick up the pdf-original and already parsed text here.

P.S.
Added 32 questions, errors remained:
the question "0003.0009.0103.0613 - Funeral services"
and remains in the topic "0003.0009.0103.0000 - Catering"

I wrote A.V. Popov to fix. Apparently it is difficult.

Source: https://habr.com/ru/post/324224/


All Articles