Unobvious use of open data

I thought whether to write this post or not, then I decided that all the same is worth -

Even before I started to work with tightly open data, I had been engaged in various tasks for many years in the classification, analysis of texts, semi-structured data, data cleaning and enrichment.

For example, quite a long time ago I did an algorithm for automatic analysis of the full name in any spelling, sex determination and, possibly, ethnicity. This is not the most difficult task, I cite it not as something outstanding, but as something routine and typical. However, the solution to this rather typical problem. The question is how to solve this something typical.
')
And this is where open data came in handy.

However, I'll start from the beginning.

1. Fuel for the algorithm

Here we almost all know that data from the public procurement website is accessible to everyone - they can be downloaded in huge amounts from an FTP server, parsed and used for various useful purposes and tasks. Many projects have appeared since these officials began to publish these data. Yes, and I myself have used them for a long time - for the analysis of public procurement, automatic detection of violations, analysis of markets - but many tasks!

And so, some time ago, I had a desire to make an opportunity for gender analysis on any sample of data. For example, we analyze the list of deputies - and I really want to put down the floor not manually on the list, but to let the robot in and get a list with notes. And according to the list, you can then visualize - how many men, how many women, activity level by sex, income level, and so on.

Part of the problem is solved by all sorts of lists of the most popular names, well solved through the middle name and so on. This approach works well when the input stream of the full name is well structured, but when they are written in a variety of ways from “Ivan Petrov” to “Petrov IA” and a dozen more options, then it turns out that very simple and frontal solutions are not enough. Therefore, I thought about the need for a reference database of names, patronymic names and surnames, as well as recognition of the structure of the received stream name.

In order to make such reference books you need a structured base of the full name that can be used for this purpose.

The question is how to find the optimal solution?

The decision was nearby. Names of persons are in many large arrays of open data that open authorities. In particular, the full name in the form of contact persons, responsible persons and so on is in the procurement announcements, procurement protocols and contract descriptions. And also in the contact information of the cards of organizations!

Yes, there many are duplicated, not millions of people, but only hundreds of thousands, but the data is structured and it remains only to correctly classify this sample initially, parse it into reference books and then use them to recognize names, patronymic names and surnames. Which allows us to understand the structure of the full name that enters the algorithm and to accurately determine the gender.

Now this algorithm uses a reference book of 26 thousand names, 40 thousand patronymic names and about 300 thousand surnames. With his help, we, for example, improved the base of the district police officers and now we have a database with marks of the floor (where it was determined) and there is also an analysis of the gender structure of the district police officers. Here it is all - http://data.openpolice.ru/dataset/mvd-uchast

I, of course, do not want to say that there are no other sources and bases with the full name, but there are only a few of those that are practically prepared for quick use.

2. Reconstruction of reference books

There is such a feature for many open and not very public data in that they are published-published, but their description is difficult to find and even harder to find directories that are used inside them. Most often this did not come from evil (from evil, they simply try not to publish the data at all), but from a lack of understanding of the needs of potential data users.

I will give a few examples.

Budget reference books

The Ministry of Finance of Russia regularly publishes data on the state budget and its execution. These are large sheets in Excel format of files on their website - right here in the " Budget List " section.

There are a lot of different lines in the files and their peculiarity is that in each of them there are a lot of reference books mentioned. A number of lines of the upper level determine the main managers of budget funds (GRBS), others - FKR (functional classification of expenditures), CSR (Target expense items), CWR (code of types of expenditures) and many others.

How to get these directories? Some of them are available as open data of their various state systems, but it is not always possible to find updated ones. Therefore, the most effective way is to reconstruct reference books from the data array itself. Given that the description of the budget is arranged so that the lines in it are just the names of the lines of directories, depending on the details - these directories are restored pretty quickly.

Why do you need it at all? First, these directories are needed to visualize the budget itself. Secondly, they are referred to without decryption in many other information disclosure systems, for example, in old contract registry data. This data is difficult to analyze when you do not know the references to which registry entries refer.

3. Geolocation

Suppose we have a list of organizations with phones and a desire to understand which cities and regions they belong to. The task is more than frequent and necessary for a variety of tasks. How to do it? The most effective way is to have a directory of telephone numbers of cities and determine the city by the prefixes of these directories. Such directories are on several sites, for example, on the Rostelecom site or on the Rossvyaz website in the ABC numbering section.

Only one problem - there are cities and regions, but no more detailed and without any classification codes like OKATO or KLADR. And reference books should be brought to the OKATO to achieve accuracy. But there is another way. Among the data of the public procurement site already mentioned by me and in the data of the website of state institutions (bus.gov.ru) there are many cards of organizations. This data contains both geolocation codes (KLADR and OKATO) and telephones. Hence the decision. Initially, a directory is formed on these bases that allows you to match phone number prefixes and geo-referencing, and only the phone number of the organization is then enough to determine its likely location.

4. Empty data

When in 2011 the World Bank held the Apps4Development contest, one of the projects submitted there was the Blind Data project (“blind data”) - its essence was to find holes, voids, data omissions in what the World Bank published. Now this project is not available except on the site of their competition, but when it was there the lack of data on many key issues from a large number of countries was visible.

Another example is the ClearSpending project created by the Sunlight Foundation. Their specialists analyzed budget lines and expenditure data on the basis of government contracts and revealed “empty spaces” - the lack of reporting on huge amounts of funds. And that is, not even cases of corruption, but cases when there is no public information about what and how was purchased.

These examples are only two of many. There are many others that are used for civilian control. When data publishing is used to find something about which data is not published and has not been published before. Data can be compared, compared and revealed glaring unusual cases. You just need to switch from what is, to the search for what is not.

5. And much more

The above is not exhausted. Open data as an example of the most accessible data is applicable both for the development of algorithms and for other tasks. And their similar use cannot be discounted, especially if other interesting datasets are available in the future.
For example:

the base of GPS tracks of public transport to calculate the time of arrival at the de facto stop, and not according to the schedule;
a webcam snapshot database to automatically detect anti-social behavior or count the number of people in a crowd;
dynamics of occupancy of parkings by day and time of day - for preliminary planning of the route;
database addresses to check the algorithms for their analysis.

and much more.

All I wanted to say with this post is that the result of using open data can be not only websites and mobile applications. The result may be algorithms and their improvement. As well as the use of data for far from obvious tasks.

Source: https://habr.com/ru/post/177129/

All Articles