In the process of working on a mobile application using open data, I had to become closely acquainted with the content of a number of portals, and as a result, suggestions arose on how to improve the “inner world of open data portals” in the interests of the developer.
If you are interested and you already have experience in this field, then you can compare your findings with those written below.
At the heart of working with any portal is a passport dataset. Do you want to refer to the data set, find his passport, extract the name of the set, number, link to the set and description of the fields in the set.
It seems everything is logical, from the point of view of a manual card file that a person leads, but from the point of view of the application, this is not enough, since it is not programmatically possible to obtain any information about the contents of the data set.
The developer must first ask himself what the set is, where it is located, in what format its data is.
The set passport should help the application to tell about its contents.
This is possible. It is only necessary to create in the network a register of all open data portals of Russia, adopt a unified numbering structure (or ID) of data sets and standardize the procedure for naming and maintaining their fields.
')
1. Register of open data portals of Russia.
Place in the network where all open data portals are listed (registry).
Today we are looking for links to portals in the network using a search engine, and having found a portal, we get acquainted with its contents. There is no website / pages with a list of all open data portals of Russia and links to them (or it is unknown to me).
2. Uniform structure of the number (or ID) of the data set.
Came to the portal (using the registry of item 1), I would like to understand what is posted on it. Each set is defined by the name and its number / id.
Today, whoever wants both numbers. On one portal these are numbers, on another word, on the third sentence. On some portals, the TIN is included in the number / name of the dataset, it is already good, you can pull out the region (if you realize that the TIN is present), but this is not enough.
Sets are collected in categories (not yet everywhere), which is very logical when assembled. But the implementation of category directories, each has its own. The level of portals for the coverage of information is also different, there are federal, regional, urban and township.
As a result, you want to find and use information from different portals in one application, develop your own numbering of portals, sets and categories.
Why would we not fill the number / id of the data set with meaning, to facilitate understanding of the contents of the data set by software.
To do this, it’s enough to include the number of the set (its id):
ID1 - unique portal number (tied to the federal / region code / year of the city / code ....), it is the portal id in the unified Russian classification,
ID2 - a single category code for information in a data set (that is, a single category directory should be developed and approved)
ID3 is the set number on the portal.
Any department wants to organize an open data portal, sends an application to the state registration agency, receives for itself ID1 - a portal and a single directory of categories. As a result, the data set on the portal will receive a number in the form:
ID1-ID2-ID3, and the developer will receive a ready-made mechanism for quickly finding the necessary data sets on any portal.
It will not be necessary for each city to create a separate wonderful application that implements a unique service. Change the link to the portal depending on the geolocation of the user, and use the application in any region. It will easily find the right portal and the desired category. And if necessary, it will be easy to “pull out” all available portals in the selected category of any city, region or country.
3. A standardized approach to the name and content of the data set fields.
The desired set found. Now you need to understand its internal structure.
And here today everyone has everything in their own way.
- Fields with data coordinates (geographic coordinates). Someone in their sets calls them latitude and longitude (two different fields), someone geo-coordinates and puts the values ​​in one field (x, y), someone stores in the form of an array, and someone in the form of a dictionary . At the same time, several options can easily be present on one portal (RosTurism, Moscow Government). The developer would be pleased to see a single option for storing geo-coordinates on all portals. What, anyone who suits everyone. And in the set passport a mark must be present not only about their presence, but also the type. Point, line or area.
- Content fields. It can be any, except for the html-pages placed inside, which today are filled with the portal of the Ministry of Culture. The developer needs information, as is customary to write in the defining documents, in machine-readable form, and not in the form of html-pages for subsequent parsing and searching for information also in them. The developer is not going to replicate someone's site, he makes his application. And do not forget about the traffic, which increases dramatically, when receiving information in the form of html, which pulls along the markup and fonts. A mobile application is a way of avoiding HTML, getting “bare” information, and not another way of displaying web pages.
- Link to pictures. Today, this is another area for the fantasy of data portal owners. It even reaches the storage of links to pdf-files with pictures (Ministry of Culture). And this is done without specifying the actual file extension. The developer, seeing the link to the image, intends to find anything there, but not a pdf-file.
- Link to documents. Standardization is also needed.
- Size of photos. Unfortunately, they are almost always offered as removed, that is, without thinking that they are not needed in high resolution for web and mobile applications. Spread out what is available. A very illustrative example is the reference book of employees on the Moscow Government portal. Take a look. From small, scanned photos from documents to huge photos in the office interiors. It would be nice here to have a certain standard.
3. Availability of data.
Sometimes there are preventive works on the portals. About which no one is reported. Clearly, without this it is impossible. But at this time, the application using open data becomes inoperable and takes on all the wrath of its user. Which does not understand that the application is not at all guilty. For him, it is "buggy." Because of this, users refuse the application or write bad reviews about it. For example, on the portal of the Government of Moscow, this happens periodically on Saturdays.
Portals must be required to add a portal request for Open / Close to their APIs. And if the portal also returns the estimated date of its intended discovery ... Then we will live.
4. Data quality.
The quality of data on portals suffers from two factors. Errors in the data itself and errors in the structure of the discarded data.
Grammatical, in the names of the fields and content, such as "photograph" as we will survive.
Errors in the data.
It's bad when more than a year, in the data set, the subway entrances / exits on the Moscow Government portal, there are not separate entrances, but entire metro stations. And from the “public transport stop” dataset, it turns out that the bus on route XX stops only at one stop. Where else passes its route is unknown. Or in the field patronymic, the last name is spelled out, and in the field the patronymic name, portal of the Ministry of Culture. Again, there is redundancy, why enter separate fields first name, last name and middle name, if there is a full name in the same set of fields? Little things, such as mixed latitude and longitude, we are not considering.
Errors in the data structure.
This group of errors appears when data is reset to csv format and is associated with the used separators in this format. It is very easy to find data with the separator "," where this same comma is present inside the fields themselves. As you understand, in this case it is impossible to divide the line into separate fields correctly. As in the case when inside the field there are newline characters. A similar situation arises when all the delimiters on each line are simply not displayed. They are not enough. Apparently, when filling a dataset on the portal, a beautiful XLS file is taken and directly dumped with all its headers. So you can not do.
CSV file, due to these errors, is a silent horror for the developer, a nightmare. Try to explain to the user that the obtained data cannot be sorted out due to the broken structure. But this format is still leading.
And the most unpleasant mistakes are when, according to the description given, the formation of requests for developers on the portal of the Government of the Moscow Region, you should receive json, but you will receive an answer in the form of csv, I don’t even know what category to refer to. It is necessary to immediately insert the check into the code, what came, and choose to process the received data depending on what was given (csv or json), and not on what is promised by the API description. Being determines the operation of the application.
5. Relevance of data. 2014 - 2015 are present at all, but 2016 and beyond ... This is difficult.
6. Storage and access to data.
Some use it to query and update OpenData data, others MongoDb, and so on. For the developer, each new portal is a new parsing, the application is growing. Instead of working on the new functionality, you have to debug the next option for receiving data (request - answer). Although the scheme (Passport - Set) is present. You can not do it this way.
It is necessary to negotiate and use a single solution, with a single API.
From the developer’s side, the best option is a cloud portal with one provider, based on one DBMS, with a single API, where any department can get a place for itself under its open data portal and a universal tool for working with it. I hope that the regulatory organization responsible for the open data program in the country is not difficult to arrange. Plus, she will have real control over the expenditure of funds allocated to this program.
This is easily implemented for example based on Windows Azure. I am sure that all the advantages of this option in terms of costs, speed of commissioning, reliability and cost of ownership are understandable not only to developers.
And a little about where the experience came from, that is, about the mobile application, the work on which led to what was written above.
The application is written under iOS, it works with five open data portals, these are 1147 sets and information from the Central Bank of Russia.
- The Moscow Government Portal - 697 sets,
- Moscow Region Government - 266 sets,
- Ministry of Culture - 49 sets,
- Russian Federal Agency for Tourism - 135 sets.
- The Portal of the Central Bank, without being directly referred to as the open data portal, is essentially such, since it provides information on the rates of the currencies quoted by it for any period. The necessary information for translating ruble statistics laid out on open data portals into any currency equivalent.