📜 ⬆️ ⬇️

Why open data is not needed by anyone

In the process of working on a project for open data, many public data sources had to be studied. These are both federal portals and municipal resources. Here are the most well-known sources of open data:



All these resources have the same diseases. Here they are:



This is enough to discourage the desire to use them and the data placed on them.
Now more on each item and what to do about it.


Data inconsistency


From the statistics on the data.gov.ru documents it can be seen that most of the data is in CSV format:



And this is a huge problem. The fact is that most CSV files have an invalid format. It is easy to make a mistake in CSV, and if the user does not understand the standard, then the probability of error is close to 100%. And so, what mistakes are most common:


1st place - extra quotes . This is the bane of all CSV data. An invalid quote can break the entire document.


Example: The register of licenses for pharmaceutical activity in the Novgorod region is the very first line:


" "," "",... 

2nd place - a different number of columns in the data rows.


Example: State Drug Registry


 regnumber,regdate,enddate,cancellationdate,nameregcertificate,country,tradename,internationalname,formrelease,stages,barcodes,normativedocumentation,pharmacotherapeuticgroup  N009886,28.04.2011,,,," """"",,,,~,,"   , .., Nobelweg 6, 3899 BN Zeewolde, the Netherlands,  ",," N009886-280411,2011,; ",    

We compare the header and data, we get:


 regnumber =  N009886 regdate = 28.04.2011 enddate = cancellationdate = nameregcertificate = country =  "" tradename =  internationalname =  ... 

80% of CSV files have to be edited before use. This is not a big problem for small and rarely changing data sets. But if a set of one hundred thousand lines is updated once a week, then this is a big problem.


Hence the question, why use CSV?


Fragmentation of data and lack of standards


Each service publishes data in any way.


For example, these are the column headers from the CSV file of the quarantine zone list :


 "  ", " ", "       ()", "№                (№   )", "       (№   )", "         (№  )", " " 

Geocoordinates can be presented in the form of 2 columns, in one column through a comma or in GeoJSON.


And here are a few options for presenting lists:


 "№ 223  02.09.2010 № 277  29.09.2011 № 136  14.10.2009 № 556  02.10.2013 № 452  19.10.2012" 

 "4 : 3 , 9 , 2   , 4 , 37  " 

 "OVDPhone": [ { "PhoneOVD": "(495) 601-05-36" }, { "PhoneOVD": "(495) 601-05-37" } ] 

Everything else is scattered across different resources:



How to find out that these are official sites? And why not publish data in one place?


Lack of a single search mechanism


Due to the fragmentation of data, it is not possible to search all public sources of open data. Apparently not enough national search engine for open data ...


Lack of data access APIs


To use the data in your project you need to download it. And in the future, most monitor their change and update. This presents significant challenges for large data sets.
You can avoid these difficulties if you do not download data, and use them through the API. To do this, the API must provide such functionality that would be sufficient to perform any task of working with data.


The API that some resources have (for example, data.mos.ru ) is not enough to fully work with the data. Plus they are not reliable enough for use in real projects.




All this leads to the fact that there is open data, but judging by the number of downloads on data.gov.ru , only a few of them use it.



To unlock the full potential of open data, they must be available in the most convenient form to use. To immediately start using them, and not waste time on bringing them to the correct form.


How can I fix the situation


IMHO, a resource similar to GitHub but for data would give a strong impetus to the development of open data.


Yes, there is for example data.world , but it does not yet have all the functionality that would make it GitHub for data. What characteristics should have a resource:



I am sure that soon such a resource will appear and open data will occupy a significant place in the life of every person.


')

Source: https://habr.com/ru/post/331036/


All Articles