📜 ⬆️ ⬇️

Random databases. Oracle Enterprise Data Quality Quality - Corporate Vault Shield and Sword

The process of thinking of any person is difficult to mathematize. Any business task generates a set of formal and informal documents, information from which is reflected in the corporate repository. Each task that generates any information process creates around itself a set of documents and the logic of their processing, which is little formalized in the corporate repository environment. Inside the data warehouse there should be structures for clearing the information flow. This can help the product Oracle Enterprise Data Quality, which is designed to solve the problem of cleaning the "dirty" data. But his application is not limited to this.

1. The concept of a random database.

The first person’s business relations are described by formal and informal documents such as a statement, a declaration, a labor contract, an application for placement, an application for a resource. These documents create logical connections between business processes, but, as a rule, are the product of thinking of office managers and are poorly formalized.
')
The task of any complex optimization is not only to understand formal and informal rules, but, often, to bring disparate knowledge to a common information base.

Definition A random database is a set of facts, documents, manual notes, formal documents that are processed by a person for a particular business process, but cannot be fully automatically processed due to the strong influence of the human factor.

Example. The secretary formally takes the call. The caller is interested in a product or service. The caller is not known CRM. Question: what should the caller say to be heard by a specialist?

To be more precise: how can the secretary’s business instructions allow for a formal dialogue about the business if the responsible specialist is not ready for this type of activity?

It turns out that we again come to the definition of a random database.

Maybe it contains more facts than the secretary can know. But the extra information received in it can not be. In general, when random facts of a random database arrive at the input of a formalized system, then there is such a thing as information overload - and the entire information overload can affect the performance of not only the secretary, but the entire company.

If it is used for processing purposes, the machine that reads the states of this information comes on the basis of logical conclusions to the state opposite to the person - information overload. Human logic is more flexible.

2. Application of the definition to real problems.

Imagine a store in which the price tags for random goods are noticeably overpriced or undervalued. When you exit this store, the price of 5-7 (or even 3) of the most popular goods whose price may affect the size of the total check will remain in the head of an inexperienced buyer’s shopping list. It turns out that if it were possible to know the list of goods, the price of which is most often remembered by buyers, then the rest of the prices could be varied in a relatively wide range.

Have you ever wondered why, before Lent, the meat first becomes sharply cheaper, and then it can rise in price sharply, and then disappear? The price of a product, the demand for which can fall to zero, is artificially heated first, then, passing a certain level of demand, begins to be fixed, and after a while it grows forcefully, since greed does not allow to give illiquid goods at a fair price.

The situation is almost the same in the data market. The most useful information is almost always under the surface of secondary hypotheses about its applicability and recoverability.
It is enough to lay out any information that is interesting to 5000-7000 people on any relatively unprotected resource, copypaster sites will definitely be found.

Or the famous game with phone codes "Who called me?". About a thousand sites in runet consist only of telephone numbers of various operators, to be a little higher in the search results, trying to at least sell the domain name and advertising more expensive.

3. The price of the issue when working with "dirty" data.

According to the research of the author of the article, up to 10% of the workforce of each project is diverted to writing certain data cleaning procedures. If you don’t dwell on a completely banal type and length, there are still unique identifiers, database integrity rules and business integrity rules, quantitative and qualitative unit scales, systems of units of labor-intensiveness and any other states, influences, transitions, the compilation of which requires as usual statistical , and logical and serious business analysis. Formalizing the requirements comes to the need to formalize the fact-measurement relationship both for building repositories and for solving issues on the frontend.

Agree, if the ETL processes take up 70% of the operating time of any repository, then saving 5-7% of resources on properly clearing data on conditional storage of 200,000 clients is a good bonus?

We will cover a little the questions of “dirty” data in ready-made systems. Let's say you send greetings on a national holiday to 10,000 clients via mail. How many people throw your letter with the best postcard in the mailbox, if you make a mistake in the name, surname, or incorrectly fill in the form in the form? The price of your efforts can reduce the mood of any user to zero!

4. Oracle Enterprise Data Quality - shield and sword of the corporate repository.

The screenshots we provide describe the capabilities of Oracle Enterprise Data Quality.

So, let someone have spilled water on your database or text document.


Here is a list of standard processors (logical units that allow
data or other hypotheses, or search for the required):


Random database profiler action:


Basic check of financial solvency:


Work with zip code:


Cleaning the mailing address:


Clearing user data:


Assigning a record to a certain confidence interval:


Determining user gender from indirect data:


Definition of the city and country, state:


Simplest search for keys in a random database:


Deduplication of user data:


5. Funny observations made on the results of work on Oracle EDQ.

One of the principles of comparing the contribution of writers and poets to literature is the comparison of their poetic and writing dictionaries. We give a number of dictionaries compiled in free time for tests of ready-made solutions for Oracle EDQ, Python, Java. We will be grateful if the authors-philologists in the comments lay out their results.

Number pp


Word


Entry frequency


a lion
Tolstoy, "War and Peace." Frequency table fragment
copyright dictionary.



AND.
Brodsky, "Urania".



AND.
Brodsky Complete Works, fragment frequency dictionary
the author.



N.
Nekrasov, a fragment of the frequency dictionary for the full collection
writings.



one.


and


10351


at
1037


at
5745


and
3420


3


at


5185


and
647


and
4500


at
2108


four.


not


4292


not
391


not
3022


not
1726


five.


what


3845


on
341


on
2239


I
1040


6


he


3730


as
329


as
1758


with
883


7


on


3305


with
237


with
1674


on
854


eight.


with


3030


what
168


what
1531


as
763


9.


as


2097


to
148


AND
1200


what
693


ten.


I


1896


from
147


I
1040


he
644


eleven.


him


1882


of
104


to
922


you
475


12.


to


1771


I
90


from
810


but
472


13.


that


1600


Where
88


everything
748


but
449


14.


she is


1564


than
88


by
744


So
383


15.


but


1234


behind
76


you
721


to
367


sixteen.


this


1208


by
74


AT
713


everything
344


17


said


1135


But
72


behind
687


behind
313


18.


It was


1125


neither
70


of
635


to me
309


nineteen.


So


1032


would
69


but
617


Yes
294


20.


the prince


1012


that
67


he
592


him
275


21.


behind


985


you
67


But
584


that
232


22


but


962


about
66


that
540


was
229


23.


his


918


but
63


about
538


by
224


24


everything


908


there is
61


this
524


not
223


25


by


895


I
61


I
489


neither
222


26


her


885



but
463


about
213


27.


of


845



Where
449


their
212


28





than
443


of
209


29.





BUT
428


from
207


thirty.





same
422


we
206




Conclusion: the statistics of the Russian language over the past hundred years has hardly changed in terms of the frequency of individual words, and the poets have more “singing” words. By the way, Darya Dontsova’s statistics largely coincide with Leo Tolstoy in the frequency dictionary of the complete works.

6. Several formal calculations as a conclusion.

About 60 thousand Ivanov Ivanovich Ivanovich live in our country. Assuming that 100 tables are stored hypothetically in an average database, 10 key fields in each table, and each key can take 60,000 values, we find that the total number of unique key states within the database is approximately 60 million. If even in one table two keys are confused, then they can generate up to 20 unique states in one table. In total, the database of unique states can run up to several thousand. Agree that spending 10% of development time and 5-7% of ETL execution time to catch such trifles is an unaffordable luxury?

Source: https://habr.com/ru/post/444700/


All Articles