📜 ⬆️ ⬇️

Using data sets from the open data portal of Russia data.gov.ru

The last time I analyzed the data sets: the distribution by categories and file formats, the degree of filling in the fields in the passports of data sets, etc. Now I will try to understand how often data sets are interested and how often data sets are used? Which data sets are of interest to portal users?

In order to carry out an assessment, it is necessary to decide on what criteria to produce it. In descriptions of data sets there is information about the number of their views. You do not need to be a genius to understand that if someone was looking through information about a data set, then apparently he did it not quite by accident. And, therefore, the criterion that the data set aroused interest will be the number of its views. And if the data set is not just an interest, but may be useful, it will be downloaded. Thus, the number of downloads will be a criterion of utility.

And you can still imagine that the portal is a store. Items in the store are datasets. The cost of a product is the amount of effort required to download (find where the link is) and use (for example, view or use as a source of data for your own purposes) data. Accordingly, the number of views is the number of potential buyers, and the number of downloads is the number of purchases.

Buyers go to the store, look at the goods, evaluate. If the buyer cannot find the goods or cannot understand whether he is suitable for him, he will leave. If the product is interested in the buyer, then he can buy it (download), if the price (the amount of effort spent for downloading and use) suits. For example, a certain set of data interested me, and I want to download it. But it turns out that it is in a format that is difficult for me to use. At the same time, on the other site there is the same data, but in a more convenient form or newer, or with a better description, respectively, the data set will not be downloaded.
')
First, the simplest statistical characteristics for the number of views:


The large value of the maximum compared with the average and median, as well as the difference between the median and the average value clearly hints at the uneven distribution of the number of views and the “long tail”.

To see this visually, I divide the number of views into 1000 evenly distributed groups (I average) and get a fairly smooth curve. Then I plot the dependence of the sum of all views on the average number of views and the number of data sets on the average number of views.

Distribution of views of open data sets from the data.gov.ru portal

What does the chart show?

A large number of data sets have close to zero the number of views, but the total number of views of these sets is large. Further, approximately from 100 to 1000 recession. From 1000 to 5000 fairly uniform distribution. From 5000 growth.

Numbers selected by eye. This is how the diagram looks like.

Distribution of views of open data sets from data.gov.ru portal. Diagram

Two thirds of the data sets were viewed less than 100 times.
A third of the data sets were viewed from 100 to 1000 times.
About one percent was viewed from 1,000 to 5,000 times.
And less than one tenth percent of the data sets were viewed more than 5,000 times.
But if you count on the amount of views, the picture is different.
Those sets that have been viewed less than 100 times, account for only 16%.

Nearly two-thirds, that is, the bulk of the views, are in data sets that have been viewed from 100 to 1000 times.

About 14% are datasets that have been viewed from 1,000 to 5,000 times.

And almost 7% fall on the sets, which were viewed more than 5,000 times (and there are less than one tenth of the total).

But this is not exactly what is needed to evaluate the use of data sets. The data sets were laid out at different times, so the use of absolute values, in this case the number of views, does not make much sense. For correct comparison, I will use the relative value - the number of views per month.

Statistical characteristics for the number of views of data sets per month:


In fact, the situation with the number of views per month resembles the number of views — an uneven distribution with a long tail.

Number of views of open data sets from data.gov.ru per month

Conditionally divide all data sets by the average number of views as follows:
less than once a month;



Number of views of open data sets from data.gov.ru per month. Diagram

Data sets that are viewed less than once a month, apparently, something completely unnecessary. Such data sets are about 6% and it is logical that they account for only 0.2% of the total number of views.

A third of the data sets are viewed from once a month to once a week. And they account for about 6% of the total number of views. It seems that someone sometimes looks.

Slightly more than half of the data sets were viewed from once a week to once a day. And they account for almost half of the total number of views. Not too often, but look.

The data sets that are viewed more than once a day, and only 2.5%, account for more than one third of the total number of views. That is what is of interest.

But the greatest interest is caused by those data sets that are viewed more often than once per hour. They are only 0.03 of the total, and they account for almost 4% of the total number of views.

Thus, only 3% of all data sets can really be considered interesting. A third is of no interest. And a little more than half, from time to time, may interest someone.

Products in the store a lot. But more than a third of them have little interest in buyers. More than half of the goods are not particularly interested in buyers, but interest in them is stable. And 3% of products really cause interest.

But this is only half the battle.

Even if the buyer entered the store and was interested in the product, would he buy it?

If the data set was downloaded, it means that someone needed it (and, perhaps, even very useful). Thus, as mentioned above, I will determine the usefulness of the data set based on the number of downloads.

First, as usual, some statistics:


What does this mean? Uneven distribution? A long tail?

Not. It seems to me that when the median is equal to one, we can expect an interesting result.

Number of downloads of open data sets from data.gov.ru portal

It seems that no one downloads most of the data sets at all.

Conventionally, I divided the number of downloads as follows:


Let's look at the diagram.

Number of downloads of open data sets from data.gov.ru portal. Diagram

And what do we see?

Half of the data sets never downloaded. Even to check what works, did not download. Even by chance. NEVER!

Only once downloaded 16% of the data sets. Perhaps by chance or to check that they are. They account for about 3% of the total number of downloads.

Twice downloaded 7% of data sets and they account for about 3% of the total number of downloads. Twice too doubtful result.

Nearly 17% of data sets were downloaded less than 10 times, and they account for 17% of the total number of downloads.

If you put it together, it turns out that 90% of the data sets are not at all interesting or practically of no interest?

From 10 to 100 times downloaded about 9% of the data sets, and their share is about 40%.
0.5% of the data sets were downloaded from 100 to 1000 times, but they account for a quarter of all downloads.

More than 1000 times downloaded only 0.02% of the total number of data sets, and they constitute about 8% of all downloads.

As a result, half of the data sets were never needed by anyone. 10% of data sets are of stable interest for use. Less than 1% of the data set provides real value.

Half of the products in the store do not buy in principle. A third of the goods bought very rarely. 10% of goods are in stable demand. And less than 1% of goods are really in demand by buyers.

But, as with the number of views, it is more correct to consider not absolute values, but relative ones.

By analogy, instead of the number of downloads will be the number of downloads per month.

Statistics briefly:


It is logical that again the same with the same.

The number of downloads per month of open data sets from the data.gov.ru portal. Diagram

It is clear that half of the data sets are never downloaded and the graph does not look very nice.

The diagram is more informative.

The number of downloads per month of open data sets from the data.gov.ru portal. Diagram

The same half of the sets (apparently, the rounding error led to the difference in shares) is never downloaded. This fact is already known.

Almost half of the data sets (45%) are downloaded less than once a month, and they account for 42% of the total number of downloads.

From once a month to once a week, about 4% is downloaded, but they account for almost a quarter of downloads.

About 0.8% of the data sets are downloaded from once a week to once a day, but they account for almost 23% of the total number of downloads.

And, finally, from once a week to once a hour, only 0.05% of the data sets are downloaded, but they account for almost 11% of all downloads.

If, for example, we assume that the portal is a store, the number of views is the number of visitors to the store, and the number of downloads is the number of purchases, then we can calculate the conversion:

Conversion rate
The conversion rate is the percentage of visitors to the store, site, marketing event, who made the choice, made the purchase, to the total number of all visitors.

Conversion in sales - the ratio of customers (store, company) to the total number of visitors (applied customers).

Conversion in advertising - the ratio of the number of ad impressions to the number of calls to the advertiser.

Conversion in Internet marketing is the ratio of site visitors who performed the “necessary” action (clicked on the link, voted, bought) to the total number of site visitors.

Usually the conversion rate is calculated as a percentage. The conversion rate for the visitors of the Internet shops (i.e., the share of the site visitors who have made the purchase) is on average 2-5%. For example, the purpose of the site is the sale of books, and in one day you had 500 visitors to the site and 35 sold books. Then the conversion will be 35 * 100/500 = 7%.

The conversion level shows how well the marketing efforts to attract visitors and customers, as well as efforts to fill the site with information, the store as a product, perform the main task - to ensure sales.

Successful conversion is treated differently by sellers, advertisers, or content providers for the site. For the seller, a successful conversion will mean a purchase transaction. For a content provider, a successful conversion may mean registering visitors on the site, on the forum, on a marketing event, subscribing to a mailing list, downloading software, or some other action expected from visitors.

The concept of conversion level is applicable not only to electronic media, electronic conversion, but in any case, when attracting customers is not the final task, and it is more important to get benefits from attracted customers - as the end result of a multi-stage (attract-engage-sell) marketing task on working with clients.

K = N / N0 * 100%, where

K - conversion rate;
N - the number of real buyers (customers who bought goods or used the service);
N0 - the number of visitors to the store or site.

For an open data portal, the conversion rate will be about 3%. Much or less, everyone can decide for himself.

findings


Only about 3% of the data sets are really interesting to someone. But, at the same time, almost half is viewed from once a week to once a day.

Half of the data sets were never downloaded by anyone.

Less than 1% of the data sets are indeed of interest.

What's next?


And then we will look at how to evaluate the data sets, check whether the links to the data sets work. Let's see how often the datasets are updated and the size of the dataset files. Is there a relationship between the file format of the data set and the number of downloads.

PS As an illustration, I laid out several analytic panels .
Resources are limited, so errors may occur during the download.
Write reviews in the comments.

Source: https://habr.com/ru/post/401543/


All Articles