Repeated use of data from one source even with minor changes in their content, structure and format necessitates the solution of various instrumental, informational, engineering, managerial and legal problems.
A single application of a data set can be implemented in a “manual” mode. But already repeated and subsequent updates of data based on the same updated sets force us to consider the problem of automation, at least partially. An even higher level of tasks is solved within the framework of a system that uses a lot of public data sets that are periodically updated and from different suppliers.

The publication is another in a series on the topic of public data.
Links to previous articles
Obtaining and using data that is provided to an unlimited or conditionally limited circle of users is somewhat more difficult than working
under a closed data transfer scheme - the
dependence on the supplier is strong, and the interaction with it is minimal .
')
Let us turn to several principles for managing the use of public data.
Strategy
The development of a public data strategy for the recipient is the basis of rational activities aimed at extracting decent and high-quality results. Obviously, if the user organizes this work seriously and competently, then such a strategy follows and is a continuation of the strategies for finding new data and knowledge, knowledge management and business analytics, as well as the strategy of scientific and technological development of the business as a whole.
Of course, specific cases of “trial” search and use of subject public data for solving an actual special task are not excluded. If the need is reduced only to the operational task of obtaining the missing or clarifying data that is available in an open and free mode, then there is clearly no full motivation for building the whole system of “mining and processing public figures”. Nevertheless, even in such cases, it is sometimes useful to understand what problems one-time implementation of a public data set can be associated with.
Strategically, for the regular recipient of public data such directions are important.- Determination of data acquisition objectives and key subject areas within which it searches for new digital sets. Binding to the internal issues and the system of business intelligence is required. In a sensible business, publicly uploaded public data will be applied specifically to economic and managerial analytics using proprietary or leased software.
- The formulation of large subtasks of data transfer under a public scheme in accordance with the goals and subject areas with preliminary prediction (prediction) of the expected results.
- Formalization of data selection criteria for searching and obtaining, including content, structural and format aspects. It is even possible in the form of internal closed or public regulations (standards, rules).
- The plan for searching and selecting data in the format of general principles or even at the level of individual events. It may be interesting for some professional public data providers to find out such plans of active and authoritative recipients.
- Building a system of direct and consistent quality control of public data. It is intended, through certain key and auxiliary operations, throughout the entire process of processing data sets to monitor the quality in a comprehensive manner and, if necessary, make timely adjustments or mark the data as not applicable. Here it is important to be able to give feedback to the supplier on critical issues that are found in the data.
- The public data supervisor is a separate control and coordinating function, the purpose of which is the general and problematic evaluation of the search and data acquisition process for the purposes of the user. For the “supervisor”, it is necessary to define benchmarks and give the opportunity not only to actively observe and intervene in the direct data selection procedures, but also in the processes and objects within the user organization that accept or can assume the direct effect of new decisions and knowledge (products and services).
- Personnel support of public data both through the allocation of functionality to individual positions, and through a reasonable addition of the functionality of already existing positions. Do not forget about the competence of individual employees in the field of public data.
- The support of the tools for searching, selecting, acquiring and applying data is due to the complexity of the procedures for directly using digital data sets.
- Technical support for data acquisition in terms of evaluation and additional allocation of machine resources (storage locations, computing power, specialists).
- Legal support for obtaining and applying data both at the level of acceptance of a general contract (list of conditions) of public data transfer set by the supplier, and at subsequent levels of processing and re-transfer of data or results based on them.
- Marketing support for obtaining data to identify issues for potential suppliers and encouraging them to freely distribute and update digital data sets.
It is worth noting that some items coincide with the strategy of the public data provider, but with exactly the opposite direction. This is a consequence of a certain “mirror-like” construction of the strategies of the supplier and the recipient of public data.
The basic purpose of the recipient's competent public data management strategy is, by and large, to effectively search or retrieve the necessary data and then use it as part of its own business analytics in order to identify and formulate new knowledge (creating solutions, products, services, etc. ) in free mode.Recipients (users) of public data are different. And each has its own strategy. If a large corporation is involved in collecting and using free digital data, then it focuses on consistency, scale, algorithms, and competencies. If a private person (expert, entrepreneur) does the same, he will most likely focus on specifics and one-time results.
Search
On the one hand , in our “digital” world there are almost no problems left with finding answers to simple
textual questions. It is enough to set the correct query in the field of a special search service. Then there is a chance to spend time in viewing the issued links and iterative refinement of the request.
On the other hand , the search for digital data sets is a completely different task, which will have to be solved in several different ways, referring rather not just to the search service, but to the domain from which data is required. Separate search engines for public data are not yet observed, but consolidated catalogs and entire portals are already emerging. Significantly helps the community of experts and exchange links.
In many ways, the problem of data retrieval, in addition to the actual detection of sets of the desired subject, lies in determining and confirming the quality of the information found and answering the question “
can this data be used to solve my problem ”. Therefore, it is important to find suitable data in the sense of accompanied by detailed metadata and, better yet, having a reliable assessment of quality.
In this regard, understanding the type of public data also in a certain way helps to resolve the issue of their applicability in a given situation. For example, it is necessary to trust the contents of shared data with caution and with their mandatory verification. At least according to a number of simple criteria, evaluating according to the principle “I believe - I do not believe”. Attention can be focused on aggregated figures for the entire data set or for individual samples.
The recipient (user) should always control the public data on the source and their possible changes.We must be prepared that the data will be changed, and the period of time during which the data will be
conditionally stable can only be tried. Acting within the framework of the variability of the meaning, structure and format of public data, one has to resort to a special way of organizing work with them and choosing more universal processing tools.
As a rule, the search for public data is always carried out on the appropriate content.And if for a given topic you can find the right sets of digital data, then this is already good. However, remember that in the data, in addition to the meaning, the structure and format are also important. But this seems to be an unaffordable luxury to refuse the data found if the recipient is not satisfied with the structure of their organization or one of the layers of the format. It does not matter - the user will apply the tools of restructuring and reformatting, of course, if he finds the right ones. Meanwhile, this problem is easily solved by the supplier by replacing the static way of publishing data with a dynamic one, i.e. Data files are replaced with APIs with different upload options. But on the other hand, the search for a public data set packaged in a file and an API search on demand, to which you can get the same public data set, are two different stories.
Loading
When the user found the necessary data and received their free copy onto his carrier, he successfully carried out the so-called download, which he found for himself an endless sea of ​​pleasure in solving various related problems.
What else will he have to do besides simply getting a set of digital data, if he is still trying to do the right thing right.Well, for example, in addition it would be possible:- to formalize a successful and efficient way to search and find the necessary data (when, in what sequence, what series of queries, what links you had to go through, what brought the result);
- fix the time and place of receipt of the data set, as well as the supplier and the conditions of data distribution;
- check the data format for each level (encoding, notation, schema);
- get and save the maximum available metadata related to the target data set;
- try to extract from the environment in which the target data, additional possible metadata and references or context descriptions are located;
- view the explicitly indicated or indirectly designated data context;
- obtain an estimate of the quality of the data and give your own preliminary assessment of the quality of the data found;
- find out possible and preferred methods of feedback with the data provider (their owner or author);
- preliminarily determine the need for further updated data.
And the more systematically the recipient (user) tries to organize work with public data, the more clearly and consciously he will have to do these and other things not after, but already at the moment of loading the found data.
In most cases, to confirm the quality and / or authenticity of the data, as well as for subsequent back auditing, it is recommended to keep a direct copy of the downloaded data in an accessible place in the repository.And it is best to do it exactly in the form and format in which the set of digital data was received from the supplier (from a network resource). Subsequently, intermediate data processing results can also be saved as needed, but their primary version is especially important if the supplier makes changes to them without notice or reservations. You can make an exception for public data, which is likely not to change in the future and storing the original copy of the downloaded data incurs unacceptable storage costs. However, in this case, the risk of an
“unexpected change in immutable data” remains and the likelihood of adverse situations associated with it remains.
The recipient of the data independently decides for himself how his decision (product or service) will be built in terms of possible adjustment of the source data and whether it should be made dynamic with respect to the data source or static-working on data snapshots. For each options - their own advantages and risks.
Implementation
The processing and analysis of public data is very rarely limited to just the sets received. Usually, this process involves the full array of accumulated information, including additional internal or previously obtained data, structured in a targeted way.
Even if the development is carried out exclusively on public data, they are mixed from different sources and are “seasoned” with previous calculations, estimates and aggregators. Therefore, with respect to the built-in system of economic and managerial analysis over a long period, we can speak about the implementation of the loaded data sets into the general array of disposable (stored) information.
We can distinguish three general schemes of implementation of the received public data into a common repository:- capacitive (or historical) - increasing increasing (expanding) and saving all changes in the received data within a given subject area, including supporting all versions of structures and formats;
- managed (or updated) - changes exactly to the actual content, structure and layout of the received data;
- user (or target) - changes to reflect changes in the content, structure and layout of the resulting subject data, but in general is created as independent of the sources, and dependent on the existing objectives (goals).
Traditionally, the implementation of data in their own repositories and analytics models provides for their active processing, including filtering, intermediate calculations, adjustments, filling gaps.This is not a direct processing and analytics of data, but only a procedure to bring the data
to a common denominator . And such a common denominator depends on specific goals, on the characteristics of the content, structure and format of the data, on the tasks and parameters of the repository. Probably, it will be necessary here to form “snapshots” of data - historical cuts, which allows you to control the quality of the data along the chain and, if necessary, reversely restore their individual elements.
In addition to all this, it is already necessary at this stage to form additional derived internal metadata for the set of publicly implemented data according to the regulations of the corresponding repository and analysis model.
Implementation , as the preparation of data and bring them to the target working condition, is an important step that requires professionalism and effective tools.
Feedback
As well as for a public data provider, two levels of feedback can be distinguished for their recipient (user).
On the first -
simple - the recipient of digital data returns to the supplier his opinion about the quality and quantity of the loaded figure, sometimes accompanying him with wishes for subsequent publications.
The second , more
complex ,
level is the return to the supplier of knowledge and solutions (products or services) obtained, including with the use of the kits placed by him, in exchange for a new or additional portion of qualitative data or a new data quality.
Such a link can even grow into something more than just sharing data, knowledge and competencies, but this is a matter of development and combining business interests.
One of the indirect methods of complex feedback of the recipient with the public data provider is their targeted re-transfer - retransmission to third parties in the original or processed form, and possibly even in the form of new solutions (knowledge). Observing the conditions of the re-transfer, which the supplier establishes, a kind of intermediary, firstly, can notify the supplier of redistribution, and secondly, expands the competent contacts of the subjects involving new participants in the process of working with public data. This scheme allows you to assess interest in the data and reach a large target audience. Tracking such chains requires the supplier organization of the relevant business processes at a high enough level.