Random databases. Oracle Enterprise Data Quality Quality - Corporate Vault Shield and Sword

The process of thinking of any person is difficult to mathematize. Any business task generates a set of formal and informal documents, information from which is reflected in the corporate repository. Each task that generates any information process creates around itself a set of documents and the logic of their processing, which is little formalized in the corporate repository environment. Inside the data warehouse there should be structures for clearing the information flow. This can help the product Oracle Enterprise Data Quality, which is designed to solve the problem of cleaning the "dirty" data. But his application is not limited to this.

1. The concept of a random database.

The first person’s business relations are described by formal and informal documents such as a statement, a declaration, a labor contract, an application for placement, an application for a resource. These documents create logical connections between business processes, but, as a rule, are the product of thinking of office managers and are poorly formalized.
')
The task of any complex optimization is not only to understand formal and informal rules, but, often, to bring disparate knowledge to a common information base.

Definition A random database is a set of facts, documents, manual notes, formal documents that are processed by a person for a particular business process, but cannot be fully automatically processed due to the strong influence of the human factor.

Example. The secretary formally takes the call. The caller is interested in a product or service. The caller is not known CRM. Question: what should the caller say to be heard by a specialist?

To be more precise: how can the secretary’s business instructions allow for a formal dialogue about the business if the responsible specialist is not ready for this type of activity?

It turns out that we again come to the definition of a random database.

Maybe it contains more facts than the secretary can know. But the extra information received in it can not be. In general, when random facts of a random database arrive at the input of a formalized system, then there is such a thing as information overload - and the entire information overload can affect the performance of not only the secretary, but the entire company.

If it is used for processing purposes, the machine that reads the states of this information comes on the basis of logical conclusions to the state opposite to the person - information overload. Human logic is more flexible.

2. Application of the definition to real problems.

Imagine a store in which the price tags for random goods are noticeably overpriced or undervalued. When you exit this store, the price of 5-7 (or even 3) of the most popular goods whose price may affect the size of the total check will remain in the head of an inexperienced buyer’s shopping list. It turns out that if it were possible to know the list of goods, the price of which is most often remembered by buyers, then the rest of the prices could be varied in a relatively wide range.

Have you ever wondered why, before Lent, the meat first becomes sharply cheaper, and then it can rise in price sharply, and then disappear? The price of a product, the demand for which can fall to zero, is artificially heated first, then, passing a certain level of demand, begins to be fixed, and after a while it grows forcefully, since greed does not allow to give illiquid goods at a fair price.

The situation is almost the same in the data market. The most useful information is almost always under the surface of secondary hypotheses about its applicability and recoverability.
It is enough to lay out any information that is interesting to 5000-7000 people on any relatively unprotected resource, copypaster sites will definitely be found.

Or the famous game with phone codes "Who called me?". About a thousand sites in runet consist only of telephone numbers of various operators, to be a little higher in the search results, trying to at least sell the domain name and advertising more expensive.

3. The price of the issue when working with "dirty" data.

According to the research of the author of the article, up to 10% of the workforce of each project is diverted to writing certain data cleaning procedures. If you don’t dwell on a completely banal type and length, there are still unique identifiers, database integrity rules and business integrity rules, quantitative and qualitative unit scales, systems of units of labor-intensiveness and any other states, influences, transitions, the compilation of which requires as usual statistical , and logical and serious business analysis. Formalizing the requirements comes to the need to formalize the fact-measurement relationship both for building repositories and for solving issues on the frontend.

Agree, if the ETL processes take up 70% of the operating time of any repository, then saving 5-7% of resources on properly clearing data on conditional storage of 200,000 clients is a good bonus?

We will cover a little the questions of “dirty” data in ready-made systems. Let's say you send greetings on a national holiday to 10,000 clients via mail. How many people throw your letter with the best postcard in the mailbox, if you make a mistake in the name, surname, or incorrectly fill in the form in the form? The price of your efforts can reduce the mood of any user to zero!

4. Oracle Enterprise Data Quality - shield and sword of the corporate repository.

The screenshots we provide describe the capabilities of Oracle Enterprise Data Quality.

So, let someone have spilled water on your database or text document.

Here is a list of standard processors (logical units that allow
data or other hypotheses, or search for the required):

Random database profiler action:

Basic check of financial solvency:

Work with zip code:

Cleaning the mailing address:

Clearing user data:

Assigning a record to a certain confidence interval:

Determining user gender from indirect data:

Definition of the city and country, state:

Simplest search for keys in a random database:

Deduplication of user data:

5. Funny observations made on the results of work on Oracle EDQ.

One of the principles of comparing the contribution of writers and poets to literature is the comparison of their poetic and writing dictionaries. We give a number of dictionaries compiled in free time for tests of ready-made solutions for Oracle EDQ, Python, Java. We will be grateful if the authors-philologists in the comments lay out their results.

Number pp	Word	Entry frequency
Number pp	Word	a lion Tolstoy, "War and Peace." Frequency table fragment copyright dictionary.	AND. Brodsky, "Urania".	AND. Brodsky Complete Works, fragment frequency dictionary the author.	N. Nekrasov, a fragment of the frequency dictionary for the full collection writings.
one.	and	10351	at 1037	at 5745	and 3420
3	at	5185	and 647	and 4500	at 2108
four.	not	4292	not 391	not 3022	not 1726
five.	what	3845	on 341	on 2239	I 1040
6	he	3730	as 329	as 1758	with 883
7	on	3305	with 237	with 1674	on 854
eight.	with	3030	what 168	what 1531	as 763
9.	as	2097	to 148	AND 1200	what 693
ten.	I	1896	from 147	I 1040	he 644
eleven.	him	1882	of 104	to 922	you 475
12.	to	1771	I 90	from 810	but 472
13.	that	1600	Where 88	everything 748	but 449
14.	she is	1564	than 88	by 744	So 383
15.	but	1234	behind 76	you 721	to 367
sixteen.	this	1208	by 74	AT 713	everything 344
17	said	1135	But 72	behind 687	behind 313
18.	It was	1125	neither 70	of 635	to me 309
nineteen.	So	1032	would 69	but 617	Yes 294
20.	the prince	1012	that 67	he 592	him 275
21.	behind	985	you 67	But 584	that 232
22	but	962	about 66	that 540	was 229
23.	his	918	but 63	about 538	by 224
24	everything	908	there is 61	this 524	not 223
25	by	895	I 61	I 489	neither 222
26	her	885		but 463	about 213
27.	of	845		Where 449	their 212
28				than 443	of 209
29.				BUT 428	from 207
thirty.				same 422	we 206

Conclusion: the statistics of the Russian language over the past hundred years has hardly changed in terms of the frequency of individual words, and the poets have more “singing” words. By the way, Darya Dontsova’s statistics largely coincide with Leo Tolstoy in the frequency dictionary of the complete works.

6. Several formal calculations as a conclusion.

About 60 thousand Ivanov Ivanovich Ivanovich live in our country. Assuming that 100 tables are stored hypothetically in an average database, 10 key fields in each table, and each key can take 60,000 values, we find that the total number of unique key states within the database is approximately 60 million. If even in one table two keys are confused, then they can generate up to 20 unique states in one table. In total, the database of unique states can run up to several thousand. Agree that spending 10% of development time and 5-7% of ETL execution time to catch such trifles is an unaffordable luxury?

Source: https://habr.com/ru/post/444700/

All Articles

Random databases. Oracle Enterprise Data Quality Quality - Corporate Vault Shield and Sword

More articles: