On September 14 - 15, the first
Hackaton Yandex will be held in Moscow, whose participants will create projects based on open public data using Yandex technology for two days and two nights.
For many years I have been engaged in the growing interest of Russian developers in working with open data. For this purpose, the competition Apps4Russia, organized by the non-profit partnership "Information Culture", was created. This year a nomination appeared in it for those who create applications on open data and Yandex technologies. These events pushed systematically to tell here about the history of open data, their sources, examples of use and many other important things.

')
This is a graph
from Learn eugenyboger . The fact that we can now find out the detailed election results for each polling station is the norm, and more recently this was not the case even in very developed countries.
Open Data: Background
There are several open data definitions. There is one that is given in Wikipedia: I translated it from English myself in order to cite it in a Russian-language article. There is a definition on the website of the Government, which is given
in the law . There are some more definitions, but the essence is as follows. Open data is information published by its own organizations (authorities, if it is open government data) provided in free form (ie, free of charge to non-binding licenses) and in machine-readable form suitable for repeated automatic processing. There are some criteria that define data as open. A
Creative Commons license is almost a prerequisite for open data.
In principle, open data is not a new phenomenon, it has long existed in different forms, and ideologies of openness for many years. Open source and free licenses appeared not five, not ten years ago, but much earlier. Especially in the scientific environment with its research, for the results of which it is important to be able to verify them, verify, publish and work with them in every way. Research is, as a rule, special formats that are exclusively — what we now call — machine-readable.
Nowadays, developed countries around the world seek openness in various forms. In June of this year, at a meeting of the G8 in the UK, the host country offered to sign the
Charter of Open Data . She, too, was signed by Russia. The main principles that are spelled out in the Charter are the openness of data by default, their timely publication in machine-readable form, transparency and the obligation to ensure the conditions in which developers will create applications based on open data.
“Accumulating huge amounts of data, government and business do not always share them so that they can be easily found, used and understood. These are missed opportunities. We are at a turning point that foreshadows a new era. People will be able to use open data to generate ideas and create services that will make our world a better place, ”the Charter says.
Now all G-8 countries must declare their willingness to disclose information on the level of crime, the register of companies, land transactions. The leaders in this matter, of course, are Britain and the USA, which have been doing this for many years. But now many countries of the world, including Russia, have begun to publish open data.
This was also influenced by the growth of all technological companies, and from this the growth of data value, the growth of the knowledge economy, the emergence of such large companies as Yandex, whose entire business is based on freedom of information. If every site on the Internet was paid, and the data would be impossible to aggregate, the problem of open data could not appear. Working with the public domain information has affected this very much.
As a result, several trends came together, and this very phenomenon appeared - freedom of access to information and open data. It lies in the fact that information created in the first place by the state, and in general by anyone, should in principle be available and, moreover, so that it can be reused. If someone has conducted some research, and its results are presented in the table, we should get not a picture, but a table as it is, so that we can test it, use it, and can even make a business based on these results. If the state discloses some information about its activities, then it is useful for citizens not only to know about it, but also to do something based on it. Maybe it will have a social effect, maybe it will have an economic effect, maybe it will be the effect of “civil control”, “civil anti-corruption”, etc. But still it is an economic effect, albeit in a slightly different form. Portals of open data, on which they spread huge amounts of information generated by the state, are mainly created by governments. You can make something interesting and useful out of it - this is how the ideology of openness transforms into concrete products.
But everything went not from officials and the state, but from people who started doing this much earlier. In Britain, before the open data portals appeared, there were a bunch of different small groups of developers who began to do projects like “let's link the state” - a rewired state. Or, for example,
ScraperWiki has long existed - a special engine with the help of which anyone who has a little programming in python can write programs and scripts and retrieve data from sites.
Gradually, it became so widespread that it didn’t matter whether the states open up data or not - they somehow learned to extract them. In the US, before
data.gov appeared, there were
Sunlight Labs ,
Knights Foundation , which retrieved data from Congress reports, converted PDF files into excel files, downloaded excel files into a database and converted them into .CSV. Strong public pressure led to the fact that in the Anglo-Saxon countries, officials and authorities came to a state when they either do it or do it for them. And if David Cameron did not cling to the topic of open data, did not
include it in the program of the party and would not come to power with it, then the party of greens, whose openness of data is now registered in the program, would come. And this openness is not information, but data.
Infographics The Guardian DatablogAnd the right step for the state in such a situation is to try to lead the trend, rather than resist it. And it does just that, trying to expand it into those vectors that it considers priority. This is not so bad, but it has its own specifics.
In Russia, the situation is about the same. I have been dealing with open data since 2009, until which there was no action on this direction by our state. For two years we were actively pushing the topic, and when it finally became clear that we had advanced to such an extent that we did not need the state, his representatives suddenly realized that it was better to head this trend.
Moscow has a certain claim on leadership in this - here, for example, the budget portal was made earlier than the feds did. In my opinion, the data there are placed imperfectly convenient, but you can work with them.
Usually, the first to use open data is civic activists. For example, in the States they compare congressmen to each other, constitute different ratings. Using spelling out speeches, find out how many words the congressman has spoken for the quarter.
Open Data Status
Data usually exist in three conditional forms.
The first. They are available and suitable for work. That is, the state or their owner ensures their machine readability. Here, the entry threshold is minimal - we can take them and put them on some cards, apply them in a mobile phone. Everything is ready at once.
The second. The situation is worse: there is information in principle, but it needs to be pulled from various sites. For example, information on State Duma deputies is on the State Duma’s website, but in the form of web pages, it should be extracted.
Information on the quality of water in the city of Moscow by districts is on the website of Mosvodokanal. But through a special service in which you must first enter the street, then the house number and only after that you will be given the area, the level of pollution, the levels of pollution according to different indicators.
In order to collect all this information, activists write various scrapes - programs that remove information from websites and turn it into some databases.
Third. Information in some form exists in principle, but is not available in a public space. In general, everything we do is trying to achieve openness of information. I am talking not only about myself, but also about many other activists who are actively engaged in this in Russia (including commercial companies) and are trying to achieve openness of information, that is, the following:
- So that the data that are already published in a machine-readable form, are suitable and convenient for work, so that they have the minimum number of errors.
- So that the data that is not machine readable now is converted. If they are published, let them do it so that they are useful, this is the most important thing.
- So that what is not being published now appears in the public space.
For this, we have a so-called
Open Data Council in the
Open Government . The state has declared that it is ready to participate in it, some changes in laws and regulations are being adopted. In principle, in order to start working on ensuring open data and to use them, there are no limitations anymore.
Open Data Sources
Open data is not only state. This is largely the data of huge crowdsourced Internet projects. Not everyone knows that, for example, all Wikipedia is available as dumps. Or
Wikidata . This is generally just a stunning ideology project. And
DBpedia comes from the other side. Wikidata is for people themselves to gradually convert information into data, and DBpedia to sharpen algorithms so that information boxes that have already been entered into before will be turned into connected data. The
Freebase project, which is now purchased by Google, was completely built on DBpedia and Wikipedia. The guys just downloaded the data, made an interface that allows you to add something else, and based on this they made a rather expensive product.
OpenStreetMap project . Likewise - huge data dumps are publicly available and can be used. There are a few dozen more projects that are open as crowdsourced and from which data can be collected. These are mainly various encyclopedias, reference books, user databases.
For example, in France there are activists who monitor products and bring their ingredients, EAN and EPC codes to a separate database and distribute. This creates a directory where people with nutritional restrictions can understand what foods they can eat.
That is, one part of the data is what activists create in different forms, in different forms, and the other is what the state provides. It is the largest data owner. And the third part is the data that are published by commercial and non-profit companies.
The first usually publish them in two formats. Either under duress, or guided by social responsibility or other motivation. For example, some so attract developers. Nike publishes machine-readable information on its plants.
How open data is used in the world
Developers often ask: “What can be done on the basis of open data, what are the examples?” And I always suggest looking at what others have done. It is enough to look at the sites of the competitions
NyCBigApps ,
Apps4Development ,
Apps4Berlin ,
Apps4Finland ,
Apps4SanFrancisco . Although not all examples of them can be transferred to Russia.
The guys who created the “
Do not eat here ” project did not even take the open data, but distribute the data from the New York Food Inspection website. They found where the addresses, companies' names and the results of the check are indicated on it, marked them on the map and made an application that works on the same Foursquare principle. It, based on the number of issued and unclosed prescriptions, shows where you should not go. The application was even sold for some small fee and people put it.
There are a huge number of applications that are part of the
City-Go-Round project. This is a small portal in the USA where information on transport companies and applications is aggregated based on their data - 2000 companies are collected in a separate list. 270 of them on a regular basis provide transport data in a special format - general transit feed specification (
GTFS ). And thanks to this, hundreds of applications have been created on this data.
There are, for example, projects on new media like
Storify . There is already a huge amount of open data uploaded that can be used in your mini-newspaper — create harpies or other complex visualizations based on them. Thanks to this, you can complement your stories. In Storify, an environment is created in which people themselves come up with how to use open data. In the same row, you can put a lot of projects that create infographics online, allow you to draw charts, upload ready data to yourself and manipulate already open. This is Sacrato,
Factual , the same FreeBase that Google bought from MetaWeb.

It is not always possible that you can make money on your application, because not always used data is enough to create a complete product. But the result can be monetized in other ways.
Data is like some ingredients. If you do not have salt, the dish will be tasteless, but you can eat it. If you have salt, then you can sell it more expensive, or those whom you feed will be more satisfied. Sometimes the data can be the dish itself, and sometimes this salt itself. That is, in any case, they, as a rule, rarely are self-significant. And very many projects that work on open data, in fact, use them only as a supplement.
For example, real estate services are being rapidly transformed in the United States and Great Britain. In addition to the usual criteria that all have long provided, they began to show, for example, criminal situation or weather data in the city where you plan to start living. Where does all this information come from? In the United States, weather data has been publicly available for the past twenty years. This is the most monetized open data in the world.
Crime information is disclosed by police departments. There are already several dozen projects that are based on it. Information about the environmental situation is also published. Again, it is either part of state monitoring or commercial. Therefore, I always tell developers to think not only about what they can do on their own, but about what they can build into the result of their work and how to make additional money on it.
And one of the ways to apply your development is indirect monetization - selling what you created. For example, the guys who made the Chicago Crime crime monitoring project in Chicago sold it to MSN, which made it part of its portal.
And the British are very proud that after the discovery of data on the success of heart operations in various hospitals, they have reduced the number of deaths - people began to choose hospitals based on this information.
A huge number of startups that arise in the US on open data are created to supplement open data with various existing ideas.
Open data in Russia
One of the most important things in working with open data is a convenient format. In Russia, this is often not respected. In addition, despite the fact that we have a law on open data, many government agencies may not be attentive to the information on their sites. For example, it is often forgotten to update.
Some open data in our country are published and commercial organizations. For example,
the Russian language building , which is supported by Yandex. RZD publishes all the information on the benefits that it provides. We can find out who got how many benefits, information on tariffs, financial statements. You just have to go through the sites of corporations and watch what is published there.
Graph based on data on the turnout of the mayor of MoscowThe EGE, for all its shortcomings, is an important plus - the quality of education in schools can be measured. But the data is scattered, so there are no decent projects based on them. And it would be possible to make an application “Pick up a school” or add this information to realtor services.
Other part: it is housing and communal services. The Moscow authorities began to disclose a bunch of information about the housing and communal complex. On the portal
gorod.mos.ru there is information on each house. If you parse the data from there to all the houses, you can find out how many people complain, how quickly their complaints react, etc. You just need to build a database. And even though the developers of the portal have not yet set such a goal for themselves, nothing prevents us from making it ourselves.
Our country is one of the few where government procurement data is fully disclosed. Their processing is not a very simple task, because it is big data. But they can be made convenient services, for example, for suppliers.
Government data is now scattered in Russia in a heap of portals. Each ministry, each federal department has its own special section. We have several open data portals: the portal of Moscow, the portal of the Ulyanovsk region, now there will be a portal of the Tula region, Perm Territory, Perm. “Informculture” has a portal
hubofdata.ru , where we
load dozens of gigabytes of useful and not very data with massive scripts. We have 3,000 arrays there only by statistics; data on the votes of State Duma deputies, economic registers, all data from Moscow, all data from the Ulyanovsk region.

There is a similar portal - this is
ar.gov.ru , which leads the Ministry of Economic Development. They are now just cataloging and keep a catalog of everything that is there. Openly available data on the budget of the city of Moscow - on a special portal
budget.mos.ru , where even there is a section for developers.
While the publication of open data is mandatory only for federal agencies. The process is progressing gradually. We have many laws that are not enforced. For example, federal law
N 8- - on openness of information. God forbid, 10% of government agencies correspond to him 100%. The rest - in some small ways - violate it. And not always consciously, but rather because of the negligence of people who maintain official websites. But the signed Charter and the adopted law on open data indicate that working with them has already become part of state policy. Our peculiarity is that we do not know what kind of information exists in principle. For example, there is a transcript of the speeches of deputies. Now it is not machine readable, but we have a machine readable version.
If you have any ideas and you need help, you can write to me - I will always tell you what data you can use for your purposes and where you can get them.
What is Apps4Russia
One of the important tasks is to spark interest in open data. To do this,
Apps4Russia was created - a long continuous contest for developers, which we did before the state became interested in this topic. In 2011, seven people raised their own money to create a prize fund, and held the first competition, in which there were about fifteen meaningful applications. After it, we created the
non-commercial partnership "Information Culture" and now we are holding a competition for the third time. Its main task is to motivate developers to access open data, make them understand that they can and should be used for their projects.
One great project participated in Apps4Russia - social card. This application, which, according to the coordinates of the mobile phone, determined which government agencies are nearby, and immediately brought their phones: ECD, government, police station, etc. This is open data that has been collected from different sites and systematized. We recently held a small competition based on police data. Within its framework, several applications have appeared that help to know your district police officer.
This year, Apps4Russia has a Yandex nomination in which applications created on its technologies will compete. It has a very specific idea: Yandex is a service company that also works on open data and creates many opportunities for developers to improve the quality of their products. It is difficult to measure how many projects earned on Yandex.Maps, but the product quality of many of them has certainly improved. You can use not only Yandex.Maps, but the Yandex.Search
API ,
API of other services .

In addition to the generally accepted API, Yandex also has technologies that are specifically designed to process the language in a free form. For example, some time ago,
Tomita's parser became open, designed specifically for this. He helps to understand the meaning of the text, for example, Yandex.News.
And with the help of the search and registry of hospitals, you can make a search engine for hospitals. Or create a mobile application for prosecutors or people interested in prosecutors, collecting data from all sites of prosecutors and adding news to RSS. And sell it to the prosecutors themselves.
You can take a small piece from each data set and use it somehow. If the registry of organizations has their web addresses, you can restart the robot, collect RSS feeds and make the mobile application “Latest Moscow City News” - all Moscow departments have an RSS feed. All this can be done on Yandex technologies - you just need to go to
api.yandex.ru . This year, accepting applications for Apps4Russia ends on September 16, but it is likely that we will extend it.
Hackathon according to open data in Yandex
On September 14 - 15, the first Hackaton of Yandex will be held in Moscow. Two days and two nights, developers will create applications based on open state data and Yandex technologies. You can even participate in it with teams of up to five people. And you can come ready-made team, but you can organize on the spot.
If you can think for a long time at a competition, and then do something for him in two hours, then you need to think quickly on Hackathon. As a rule, it is necessary to come to him prepared. Therefore, think in advance about what you will do, understand where you will look for information, learn the API. Of course, they will help you on the spot: there will be both open data consultants and Yandex technologies.
I want to emphasize once again that it is not necessary to immediately make the product that you will sell. You can make it part of another product. You can sell yourself - due to the fact that you are implementing a high-quality piece of a project. And not necessarily the employer - it's just a job for reputation. On the Hackathon, you can show that you can create cool things on the basis of some information and some tools.
The main task of both Apps4Russia and
Hackathon Yandex is to show that there is a lot of information and technologies around which you can use to create something useful.