I am sure that many of you, and perhaps all have already come across a
site of public services .
What I observe, one way or another, is it good or bad, but there is interest in it.
However, in order to fully realize this interest, I personally believe that open data is necessary.
And there is such open data. Albeit not provided by the Ministry of Communications, but rather extracted from the site of public services by a special parser, but they exist.
For example, this data allowed me a month ago to get some
interesting figures on the analysis of organizations on this site and their contacts.
')
I will quote from that post:
On the website of state services 19989 registered state organizations.
All organizations have 6730 unique email 'addresses (for some structures, the addresses are duplicated, so we consider only unique ones). Of them:
- 412 (6%) - filled incorrectly, do not pass validation.
- 59 (1%) - indicate non-existent domains
- 1517 (22.5%) are free email addresses such as Mail.ru, Google Mail, Yandex.Mail and Rambler Mail.
More details for each:
- 982 (64.7%) - Mail.ru
- 305 (20.1%) - Yandex.Mail
- 118 (7.8%) - Rambler Mail
- 112 (7.4%) - Google Mail
- 30 - HotMail (1.97%)However, I looked at all this on one side only and I am quite sure that there are much more problems there. For example, in many cases, completely incorrect contact numbers, a huge number of organizations without places of service, many organizations in general are not connected to services, most of the organizations do not have contacts, and so on.
Surely, many of you will be able to find there interesting data for visualization and analysis.
And the data itself is available in formats suitable for use in
MongoDB :
- in JSON format through Mongoexport -
http://export.opengovdata.ru/raw/gs_json.7z- in BSON format through Mongodump -
http://export.opengovdata.ru/raw/gs_bson.7zThe array is more focused on analyzing organizations, rather than public services, so the main table there is orgs. There are also several auxiliary tables through which the statistics on domains, email addresses and so on were considered.
Data structure description is as follows.
Collection
orgs - organizations
- _id - unique organization code in the system, Mongodb identifier
- key - the unique code of the organization on the website Gosuslug
- name - organization name
- url - link on the website of state services
- level - the level of organization subordination
- parent - parent organization code, if any
- profile - an array of arrays of 2 lines each with a list of fields from the organization's profile
- childs - subsidiaries in the form of a dictionary
- childs.num - the number of organizations
- childs.list - list / array of organization codes
- services - dictionary with description of services provided by this organization
- services.exists - the flag of the existence of the service block of the organization
- services.items - an array of service dictionaries with name and url fields
- suborgs - dictionary of subordinate organizations
- suborgs.exists - the organization’s flag
- suborgs.items - an array of service dictionaries with the fields key, name and url
- unknown - the block of "unknown" page in the form of a dictionary. Present only if there are no other blocks.
- unknown.exists - the flag of the existence of the service block of the organization
- unknown.items - array, always empty
- contacts - organization contact dictionary
- contacts.exists - the flag of the organization’s existence of a block
- contacts.items - an array of strings with contacts
- places - dictionary of service locations
- places.exists - the flag of the organization’s existence of a block
- places.items - an array of strings describing the places of service
Collection
pages - pages
- _id - unique code in the system, Mongodb identifier
- url - link to the requested page
- rurl - the url of the page after the redirect from the site of public services
- page - a piece of HTML code page content.
The
domains collection is the domains of sites (based on data on email addresses)
- _id - unique code in the system, Mongodb identifier
- domain - domain
- has_a - flag presence of A record in DNS
- a - an array of dictionaries with a name field and a list of query results A to DNS
- has_mx - flag of having MX record in DNS
- mx - an array of dictionaries with fields name (server name), l2_dom (second-level domain server), priority (priority) and a list of MX query results for DNS
Collection
mx_servers - mail servers
- _id - unique code in the system, Mongodb identifier
- domain - mail server domain
- l2_dom - second level domain
- num_domains - the number of domains using this MX server
- domains - an array of domains using this MX server
Collection
emails - email addresses from contacts of organizations
- _id - unique code in the system, Mongodb identifier
- email - email address
- domain - email address of alresa
- parsed - the flag that the email address is parsed
- valid - the flag that the email address is correct
- has_a - flag presence of A record in DNS
- a - an array of dictionaries with a name field and a list of query results A to DNS
- has_mx - flag of having MX record in DNS
- mx - an array of dictionaries with fields name (server name), l2_dom (second-level domain server), priority (priority) and a list of MX query results for DNS
Collection
services - government services
the description is still incomplete, the services have only names and bindings to organizations- _id - unique code in the system, Mongodb identifier
- name - the name of public services
- url - link to the website of public services
- num_orgs - the number of organizations
- orgs - an array of codes of organizations providing this service
And also, those of you who think about how you can work with this data, I suggest to pay attention to the catalog in
OpenGovData.ru which data you can try to use to improve / analyze data on public services.
I can also send the code for retrieving and parsing data from the state services to those who wish. I will soon post it, in any case, in the public domain, but so far it is not particularly ready for publication - without comments and explanations.